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ETAPS Foreword 


Welcome to the 27th ETAPS! ETAPS 2024 took place in Luxembourg City, the 
beautiful capital of Luxembourg. 

ETAPS 2024 is the 27th instance of the European Joint Conferences on Theory and 
Practice of Software. ETAPS is an annual federated conference established in 1998, 
and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each con- 
ference has its own Program Committee (PC) and its own Steering Committee (SC). 
The conferences cover various aspects of software systems, ranging from theoretical 
computer science to foundations of programming languages, analysis tools, and formal 
approaches to software engineering. Organising these conferences in a coherent, highly 
synchronized conference programme enables researchers to participate in an exciting 
event, having the possibility to meet many colleagues working in different directions in 
the field, and to easily attend talks of different conferences. On the weekend before the 
main conference, numerous satellite workshops took place that attracted many 
researchers from all over the globe. 

ETAPS 2024 received 352 submissions in total, 117 of which were accepted, 
yielding an overall acceptance rate of 33%. I thank all the authors for their interest in 
ETAPS, all the reviewers for their reviewing efforts, the PC members for their con- 
tributions, and in particular the PC (co-)chairs for their hard work in running this entire 
intensive process. Last but not least, my congratulations to all authors of the accepted 
papers! 

ETAPS 2024 featured the unifying invited speakers Sandrine Blazy (University of 
Rennes, France) and Lars Birkedal (Aarhus University, Denmark), and the invited 
speakers Ruzica Piskac (Yale University, USA) for TACAS and Jérôme Leroux 
(Laboratoire Bordelais de Recherche en Informatique, France) for FoSSaCS. Invited 
tutorials were provided by Tamar Sharon (Radboud University, the Netherlands) on 
computer ethics and David Monniaux (Verimag, France) on abstract interpretation. 

As part of the programme we had the first ETAPS industry day. The goal of this day 
was to bring industrial practitioners into the heart of the research community and to 
catalyze the interaction between industry and academia. The day was organized by 
Nikolai Kosmatov (Thales Research and Technology, France) and Andrzej Wasowski 
(IT University of Copenhagen, Denmark). 

ETAPS 2024 was organized by the SnT - Interdisciplinary Centre for Security, 
Reliability and Trust, University of Luxembourg. The University of Luxembourg was 
founded in 2003. The university is one of the best and most international young 
universities with 6,000 students from 130 countries and 1,500 academics from all over 
the globe. The local organisation team consisted of Peter Y.A. Ryan (general chair), 
Peter B. Roenne (organisation chair), Maxime Cordy and Renzo Gaston Degiovanni 
(workshop chairs), Magali Martin and Isana Nascimento (event manager), Marjan 
Skrobot (publicity chair), and Afonso Arriaga (local proceedings chair). This team also 


vi ETAPS Foreword 


organised the online edition of ETAPS 2021, and now we are happy that they agreed to 
also organise a physical edition of ETAPS. 

ETAPS 2024 is further supported by the following associations and societies: 
ETAPS e.V., EATCS (European Association for Theoretical Computer Science), 
EAPLS (European Association for Programming Languages and Systems), and EASST 
(European Association of Software Science and Technology). 

The ETAPS Steering Committee consists of an Executive Board, and representa- 
tives of the individual ETAPS conferences, as well as representatives of EATCS, 
EAPLS, and EASST. The Executive Board consists of Marieke Huisman (Twente, 
chair), Andrzej Wasowski (Copenhagen), Thomas Noll (Aachen), Jan Kofron (Prague), 
Barbara König (Duisburg), Arnd Hartmanns (Twente), Caterina Urban (Inria), Jan 
Křetínský (Munich), Elizabeth Polgreen (Edinburgh), and Lenore Zuck (Chicago). 

Other members of the steering committee are: Maurice ter Beek (Pisa), Dirk Beyer 
(Munich), Artur Boronat (Leicester), Luis Caires (Lisboa), Ana Cavalcanti (York), 
Ferruccio Damiani (Torino), Bernd Finkbeiner (Saarland), Gordon Fraser (Passau), 
Arie Gurfinkel (Waterloo), Reiner Hahnle (Darmstadt), Reiko Heckel (Leicester), 
Marijn Heule (Pittsburgh), Joost-Pieter Katoen (Aachen and Twente), Delia Kesner 
(Paris), Naoki Kobayashi (Tokyo), Fabrice Kordon (Paris), Laura Kovacs (Vienna), 
Mark Lawford (Hamilton), Tiziana Margaria (Limerick), Claudio Menghi (Hamilton 
and Bergamo), Andrzej Murawski (Oxford), Laure Petrucci (Paris), Peter Y.A. Ryan 
(Luxembourg), Don Sannella (Edinburgh), Viktor Vafeiadis (Kaiserslautern), Stepha- 
nie Weirich (Pennsylvania), Anton Wijs (Eindhoven), and James Worrell (Oxford). 

I would like to take this opportunity to thank all authors, keynote speakers, atten- 
dees, organizers of the satellite workshops, and Springer Nature for their support. 
ETAPS 2024 was also generously supported by a RESCOM grant from the Luxem- 
bourg National Research Foundation (project 18015543). I hope you all enjoyed 
ETAPS 2024. 

Finally, a big thanks to both Peters, Magali and Isana and their local organization 
team for all their enormous efforts to make ETAPS a fantastic event. 


April 2024 Marieke Huisman 
ETAPS SC Chair 
ETAPS e.V. President 


Preface 


FASE 2024 is the 27th edition of the International Conference on Fundamental 
Approaches to Software Engineering conference series. It is a forum for researchers, 
developers, and users interested in the broad field of software engineering. The topics 
of interest include requirements, design, architecture, modeling, applications of AI to 
software engineering and software engineering for AI-based systems, quality, model- 
driven engineering, processes, and software evolution. FASE 2024 was part of the 27th 
federation of European Joint Conferences on Theory and Practice of Software (ETAPS 
2024), held on April 6-11 in Luxembourg. 
There were four submission categories for FASE: 


ra 


. Research papers clearly identify and justify a principled advance to the funda- 

mentals of software engineering. 

2. Empirical-evaluation papers evaluate existing software challenges or critically 
validate current proposed solutions with scientific means, that is, by empirical 
studies, controlled experiments, rigorous case studies, and simulations. 

3. New Ideas and Emerging Results (NIER) papers seek to disrupt the status quo with 
forward-looking, thought-provoking, innovative research on the foundations of 
software engineering, as well as lessons learned from the past. 

4. Tool demonstration papers present a new tool, a new tool component, or novel 

extensions to an existing tool. 


This year, 41 papers were submitted to FASE in categories 1—4, consisting of 29 
research papers, 2 empirical-evaluation papers, 8 NIER papers, and 2 tool-demon- 
stration papers. Each paper was reviewed by three program-committee members, who 
could make use of subreviewers. It was possible to submit an artifact for evaluation 
alongside a paper, if made long-term available and declared in the Data-Availability 
Statement. The program committee extensively discussed the papers and ultimately 
decided to accept 14 papers included here. This is an acceptance rate of 34%. 

Artifacts comprise tools, models, proofs, or other data for validating the results of a 
paper. The artifact-evaluation committee (AEC) reviewed the artifacts based on their 
documentation, ease of use, and, most importantly, whether the results presented in the 
corresponding paper could be accurately reproduced. 

In an endeavor to unify artifact evaluation (AE) processes across ETAPS confer- 
ences, the FASE 2024 AEC joined forces with the ESOP and FoSSaCS AECs. Across 
all three conferences, AEC members were recruited by direct nominations from PC 
members or the AEC chairs. 

The joint call for artifacts imposed few requirements on the artifact packaging; in 
particular, there was no predefined environment in which submitted artifacts were 
supposed to be executable. Instead, author-defined container and VM submissions were 
strongly encouraged and this advice was followed by most authors. We also chose to 
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adopt a documentation standard. This greatly facilitated artifact reviews, and we 
believe that it will equally facilitate future use of the artifacts. 

AEC members from all three committees bid to review artifacts submitted by all the 
conferences. This gave the AEC flexibility to accommodate varying submission 
numbers or topic of artifacts from the conferences. The evaluation was conducted in 
three phases, an initial “kick-the-tires” phase and author response, a main review phase, 
and a discussion phase. FASE 2024 received 6 artifact submissions. All of them met 
the requirements for the “Artifacts Available” badge. In addition, 4 submissions were 
awarded the “Artifacts Evaluated — Functional” badge and 2 submissions the “Artifacts 
Evaluated — Reusable” badge. 

FASE 2024 hosted the ETAPS unifying keynote by Sandrine Blazy from the 
University of Rennes, France. These proceedings contain the invited paper supporting 
the keynote. In From Mechanized Semantics to Verified Compilation: The Clight 
Semantics of CompCert, Blazy reports on the use of operational semantics in the very 
successful CompCert project based on the Coq theorem prover. 

FASE 2024 also hosted Test-Comp 2024, the 6th International Competition on 
Software Testing. This event evaluated 20 software systems for automatic test-case 
generation for C programs. From the 14 actively participating teams, the jury selected 5 
short papers that describe their test systems. These papers are also published in these 
proceedings. They were reviewed by a separate program committee (jury). Each of the 
Test-Comp papers was assessed by at least four jury members. Two sessions in the 
FASE program were reserved for the presentation of the results: (1) a presentation 
session with a report by the competition chair and summaries by the developer teams, 
and (2) an open community meeting. 

Finally, we would like to thank all the people who helped to make FASE 2024 
successful. First, we thank the authors for submitting their papers. The PC members 
and additional reviewers did a great job: they contributed informed and detailed reports 
and engaged in the PC discussions. We thank Jan Kofron and Sebastian Junges for their 
support in our use of HotCRP for artifact evaluation. We thank Reiner Hähnle, chair 
of the FASE steering committee, and Marieke Huisman, chair of the ETAPS steering 
committee, for their valuable advice. Lastly, we would like to thank the overall 
organization team of ETAPS 2024. 
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From Mechanized Semantics to Verified 
Compilation: the Clight Semantics of CompCert 


Sandrine Blazy=)@ 


Inria, Univ Rennes, CNRS, IRISA, Rennes, France 


sandrine.blazy@irisa.fr 


Abstract. CompCert is a formally verified compiler for C that is spec- 
ified, programmed and proved correct with the Coq proof assistant. 
CompCert was used in industry to compile critical embedded software. 
Its correctness proof states that the compiler does not introduce bugs. 
This semantic preservation property involves the formal semantics of the 
source and target languages of the compiler. 

Reasoning on C semantics to prove compiler correctness is challenging, 
as C is a real language that was not designed with semantics in mind. 
This paper presents the operational style that was designed for the C 
semantics of CompCert in order to facilitate the mechanized reasoning 
on terminating and diverging programs, and details the semantics of the 
Clight source language of CompCert. 


Keywords: operational semantics of programming languages - verified 
compilation - machine-checked proofs 


1 Introduction 


Deductive verification provides very strong mathematical guarantees that a piece 
of software is correct with respect to its specification, written in a logical lan- 
guage to avoid ambiguities. A proof is conducted to provide these guarantees. 
The outcome of deductive verification is a verified software, consisting of an 
implementation and a proof that can be replayed or given to a certification au- 
thority for scrutiny. This proof requires reasoning on properties related to the 
involved programming language; they become mathematically precise as soon 
as this language has formal semantics. Defining and reasoning on realistic lan- 
guages requires mechanized semantics and machine-checked proofs, ensuring that 
the proof is complete and that no semantic rule has been forgotten. 

There are mainly two families of deductive proof tools (also known as program 
provers), each with its pros and cons: automatic tools (such as Dafny [22], F* [30] 
or Why3 [15]) where formulas (expressing pre- and post-conditions and invari- 
ants) are discharged to logic solvers, and interactive proof assistants (e.g., Coq [17], 
Isabelle [2] or Agda [1]) where the user decides how to reason and conducts the 
proof interactively with the tool, that automates part of the reasoning, ensures 
that the proof is complete and follows the laws of mathematical logic. Automatic 
program provers are easier to use when the discharged formulas are proved with- 
out requiring extra work (namely adding assertions to help the logic solvers). 


© The Author(s) 2024 
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However, when program provers fail to prove some formulas, interactive proof 
assistants are better adapted to conduct more advanced proofs. A prototypical 
example is a proof requiring reasoning on a data structure that is not used by 
the software under scrutiny, but only defined for the sole purpose of the proof 
(see for instance the proof of correctness of the famous majority algorithm [12]). 


One of the first programs whose proof was mechanized in LCF is a rudi- 
mentary compiler for arithmetic expressions [31]. In 1972, when this paper was 
published, a compiler was a representative example of a particularly complex 
program. The specification of a compiler is rather simple: the generated code 
must behave as prescribed by the semantics of the source program. This correct- 
ness property is a semantic preservation property from the source language to 
the target language of the compiler. It becomes mathematically precise as soon 
as these languages are defined by formal semantics. 


Nowadays, the compiler remains a particularly complex piece of software (due 
to the numerous optimizations it performs to generate efficient code). Moreover, 
it is the mandatory point of passage in the software production chain. Verifying 
the compiler provides a means of ensuring that no errors are introduced during 
compilation, and of preserving at target level the guarantees obtained at source 
level. The idea of having a single theorem demonstrated once and for all, along 
with a readable proof, was already present in 1972, but it took several decades 
for verified compilation to develop and scale up. 


CompCert is the first optimizing compiler for the C language targeting differ- 
ent assembly languages and used in safety-critical industries (to compile mission- 
critical embedded software used in avionics and nuclear power), with a mech- 
anized proof of correctness [23, 27,19]. In industry, the interest for CompCert 
arose from a need to improve the performances of the generated code, while 
guaranteeing the traceability requirements required by the certification author- 
ities in force in these critical fields, which CompCert has indeed provided. 


Developing a verified compiler requires both programming the compiler using 
the programming language of the proof assistant (so that it runs efficiently on real 
programs), and defining a semantic model and abstractions to reason about, in 
order to conduct the correctness proof. Mechanized reasoning on C-like languages 
is tricky; it requires a semantic style that is adapted to inductive reasoning and 
some associated reasoning principles. In CompCert, the chosen proof technique 
is the use of simulation diagrams between program executions, which required 
to define a new semantic model that is detailed in this paper. The semantic 
model and proof technique scale to realistic languages like C. They are general 
enough to be applied to all the intermediate languages of the compiler. The 
proof technique was extended and successfully reused in order to ensure other 
properties than CompCert correctness [5-7]. 


This paper is about mechanized operational semantics for compiler verifi- 
cation and their application to the CompCert compiler, with a focus on the 
Clight semantics, that significantly evolved since its first published version [9]. 
The Clight language is the preferred language to get guarantees from C pro- 
grams and then compile them with CompCert (e.g., [18, 13, 11,8, 21, 16, 33]). 
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This paper aims at providing the prerequisites needed to design new program 
transformations or analyses operating over Clight. 

All results presented in this paper have been mechanically verified using the 
Coq proof assistant [25, 32,3]. This paper is organized as follows. First, Sec- 
tion 2 recalls the early days of compiler verification. Then, Section 3 introduces 
a small-step semantics for terminating programs written using a toy imperative 
language, together with the associated proof technique based on simulation di- 
agrams. Section 4 extends this language and its semantics to observe diverging 
program executions; it defines an alternate semantics that facilitates the mecha- 
nized proofs. Section 5 defines the semantics of Clight. Related work is discussed 
in Section 6, followed by conclusions. 


Notations. For functions returning “option” types, |x] (read: “some x”) corre- 
sponds to success with return value x, and e (read: “none”) corresponds to failure. 
In grammars and rules, a* denotes 0, 1 or several occurrences of syntactic cate- 
gory a, and a’ denotes an optional occurrence of syntactic category a. € denotes 
the empty list, [a] denotes a list made of a single element x and h :: t denotes the 
list with head h and tail t. The list 1-+-+/’ denotes the concatenation of two lists l 
and l’. Given a binary relation R, R* denotes its reflexive transitive closure and 
R* its transitive closure. 


2 Historical Example: a First Verified Compiler 


The idea of verifying a compiler and stating a theorem for compiler correctness 
dates back to 1967 [29]. The proof of this theorem was mechanized in 1972 
using LCF [31]. This compiler translates in a single pass any simple arithmetic 
expression a to a code p, namely a list of instructions of a simple stack machine 
(see Fig. 1); this is the familiar translation to reverse Polish notation used by 
old HP pocket calculators. 

For instance, the expression 1+2 is compiled to the code iconst 1 :: iconst 2 :: 
iplus :: e. The stack contains numbers and the machine instructions pop their 
arguments off the stack and push their results back. This machine is close to a 
subset of the Java virtual machine. The machine code for an expression a ex- 
ecutes in sequence, and deposits the value of a at the top of the stack 7. An 
instruction either pushes an integer, or pushes the current value of a variable, 
or pops two integers then pushes their sum. 

The source and target languages are defined in Fig. 1 by their semantics. 
In [29], these are functions interpreting expressions or instructions. In this paper, 
we rather use inference rules to abstract away the definitions of all our semantics. 
The semantic judgments for evaluating expression a and executing code p are 
respectively ø + a > v and o,r F p — 7’, where a semantic element, the store 
g is injected to assign integer values to variables, and the evaluation stack m 
contains temporary integer values. 

The correctness theorem of the compiler is Theorem 2: it states that for 
any expression a, its value v computed by the semantics of the source language 
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Arithmetic expressions: 
a::=x|c|a+a source language (variable, integer constant, addition) 


ADDITION 


CONSTANT VARIABLE gta >v oF a sve 


ote=>c oF gz => o(x) 


o F ai + a: > vi +v 


VM instructions: 


i ::= ivar qx | iconstc | iplus target language 
EMPTY STAGE CONSTANT VARIABLE 
o,cen mK p>r o,0(@) th por 
g,eFo>c - ; - 7 
o,r F iconstc:: p> T o,r F ivarz:: p> rT 
ADDITION OTHER 
$ £ 
o(m+n):rFp>r on por 
on: m :: rt iplus :: p > T onrFi:np>r 


Translation from arithmetic expressions to machine code (compile function): 


a |e ii a2 > i2 


ai tae i1 + +i2 + +[iplus] 


xz > ivar t c> iconst c 


Theorem 1 (first correctness). Vaonr,o H a=>v — o,r compile(a)—> v :: T 


Proof. By induction on the structure of arithmetic expressions. 


Theorem 2 (compiler correct). YVac,o F a =v — o,e compile(a) > [v] 


Proof. By theorem 1. 


Fig. 1: Historical example: a first verified compiler. 


is exactly the value returned by executing the compiled code compile(a). This 
theorem is proved only once, for any expression given as input to the compiler. 
The verification of this tiny compiler is now taught as an exercise in masters 
courses (e.g.,[25, 32]). It is an illustrative example of the need to generalize 
a theorem, so that it can be proved by induction (here on expressions). This 
explains why Theorem 1 is proved by induction on expressions and used to 
prove Theorem 2, the main theorem for compiler correctness. 


3 A First Semantics for a Toy Imperative Language 


The previous section defines a big-step semantics for a rudimentary language 
for arithmetic expressions. In this section, we first extend this language (into a 
toy imperative language called IMP), and then introduce simulation diagrams, 
a convenient proof technique for reasoning on IMP programs. 
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Boolean expressions: 
b ::= true | false|a=al|la<a|~b|bAb_ source language 


IMP commands: 


c = skip | x := a | cc skip, assignment, sequence 
| if (b) celse c | while (b) c conditional, while loop 
EQUALITY TEST NEGATION AND 
oF a v1 oF ag v2 oFtb>v at by vı oF b2 v2 
oF a +a: > vi +v o Hx b >~wvu oF bi + b2 > vi +w 
ASSIGN 
aFfa>v 


(x := a,o) > (skip, o[z > v]) 


IF TRUE IF FALSE 
aot b= true ot b => false 
(if (b) cı else c2,0) > (c1, 0) (if (b) cı else c2,0) > (c2,0) 
SEQUENCE 


SEQUENCE DONE 


(skip; c,7) > (c,c) (c1,01) > (2,02) 


(c1; c, 01) > (c2; c, 02) 


WHILE DONE WHILE LOOP 
ot b => false ot b= true 
(while (b) c,a) > (skip, o) (while (b) c,a) > (c; (while (b) c), c) 


Fig. 2: IMP operational semantics: big-step semantics for expressions, and small- 
step semantics for commands. 


3.1 Small-step Semantics 


IMP is made of arithmetic expressions (reused from Section 2), boolean expres- 
sions and commands (skip, assignment, sequence, conditional and loop). Boolean 
expressions are used in conditionals and loops. IMP is defined in Fig. 2, where 
the semantics of arithmetic expressions defined in Fig. 1 is reused. 

Semantics observe the possible behaviors of programs and are defined using 
an operational style, that is the preferred style for machine-checked reasoning 
about semantics. Operational semantics consist of big-step semantics and small- 
step semantics, and both styles are equivalent. Moreover, proving this equiva- 
lence is a valuable way of getting confidence in the semantics and supporting 
both styles may be interesting, as it offers the possibility of choosing the most 
appropriate one for different needs. 

Choosing a style may be a matter of taste. However, big-step semantics are 
not adapted to define in a natural way some semantic features such as unstruc- 
tured control, diverging and concurrent executions, whereas small-step semantics 
are more suitable. Because of while loops (e.g., while (true) skip), the execution 
of IMP programs may diverge, contrary to the evaluation of IMP expressions. 
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So, we rather choose small-step semantics to define IMP commands, and big-step 
semantics to define IMP expressions. 

The small-step semantics is a reduction semantics between semantic states. 
A semantic state is a pair (c,o) made of a command and a store. The semantics 
takes the form of a relation (c,o) —> (c’,o’), where a command c is reduced 
into a command c’ in an execution step. The c’ command represents all the 
remaining steps and a’ is the store resulting from this computation step. The 
execution of a sequence of commands c1; C2 first iterates the reduction of cı until 
the final reduction to skip. Then, c2 is reduced. The execution of a while loop 
unfolds the loop when its body is executed at least once. So, this rule generates 
a sequence of commands that will be further reduced. 

The evaluation of expressions always terminates and the big-step seman- 
tics of expressions observe these terminating behaviors. Contrary to big-step 
semantics, small-step semantics observe in a similar and convenient way termi- 
nating executions of commands together with diverging executions. The reflexive 
transitive closure —+* of this step relation is used to chain the finite transition 
sequences. In a similar way, °° is used to chain infinite execution steps. Given 
initial and final stores o; and oy, the termination of a command c is defined as 
terminates(c;, c, f) = (c, o1) >* (skip, vf): c terminates when it is reduced to a 
skip command. Given an initial store ø, the diverging execution of a command c 
is defined as diverges(a;, c) = (c,0;) >°: all transition sequences starting from 
g; are infinite. 

Moreover, the semantics observe a third kind of behaviors, going wrong be- 
haviors (or abnormal termination), that happen for instance because of a di- 
vision by zero. Given a command c and a store a, this behavior is defined as 
goeswrong(a, c) = de’, Jo’. (c,0) >* (d,o) A (c, o) Ac! # skip: after a finite 
number of execution steps to (c’,o’), this state cannot reduce (written >) and 
it is not a final state as c’ differs from the skip command. However, abnormal 
termination is not preserved by verified compilation, as compiler optimizations 
may remove instructions leading to going wrong behaviors [24]. 


3.2 Reasoning on Operational Semantics: Simulation Diagrams 


From a proof point of view, with big-step semantics, the proof follows naturally 
the structure of programs and is conveniently conducted by induction on deriva- 
tions of big-step executions. With small-step semantics, the standard proof tech- 
nique is to rely on simulation diagrams between semantic states and involving 
invariants defining matching states. Proving a simulation requires reasoning by 
case analysis on each possible step. An interesting property of simulations is that 
they are compositional: they are chained together to describe complete program 
executions. Thus, the proof of correctness of a compiler pass mainly amounts to 
the proof of a simulation, and the tricky part often consists in finding the right 
invariants to preserve. 

The choice between a big-step and a small-step style simply on the basis of 
the adequacy to describe semantic features sometimes comes at the expense of 
the choice of the proof technique. As an example, choosing a small-step style to 
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Si, ——> So Si —> So 
E S wie with m(S2) < m(S1) 
She S 


Fig. 3: Forward-simulation diagram with measure. Black lines are hypotheses, 
red lines are conclusions. 


represent in a convenient way diverging executions of IMP prevents the use of 
standard simulations. Indeed, these simulations also represent the troublesome 
situation where infinitely many consecutive steps in the source program are sim- 
ulated by no step at all in the target program. Such situations denote incorrect 
program transformations, since some diverging behaviors are simulated by some 
terminating behaviors. In order to handle diverging execution steps and rule out 
this infinite stuttering problem, a common solution is to strengthen the invari- 
ant of the simulation with the definition of a well-founded measure (over the 
states of the source language) that for instance strictly decreases in cases where 
stuttering could occur. 

An example of a simulation diagram is the forward simulation diagram shown 
in Fig. 3 and expressed in the following theorem. Given a program P, and its 
transformed program P), each transition step in P, (from semantic state S4 to 
semantic state S2) must correspond to transitions in P> (from semantic state S} 
to semantic state S4) and preserve as an invariant a relation ~ between semantic 
states of P; and P2. The measure m(-) is defined over the states of P, and strictly 
decreases in cases where stuttering could occur. The diagram ensures that if the 
source program diverges, it must perform infinitely many non-stuttering steps, 
so the compiled code executes infinitely many transitions. 


4 Continuation-based Small-step Semantics for IMP 


Proving simulation diagrams is a general and convenient technique to reason on 
small-step semantics. This section explains how the simulation diagram defined 
in Section 3 can be used to reason on a toy imperative language extended with 
statements. Semantics describe the dynamic of programs, in contrast to com- 
piler passes, which are statically defined, for any source program. A simulation 
relates the two, by expressing that target execution steps must correspond to 
source execution steps. One issue with standard small-step semantics is that they 
describe intermediate steps involving new commands that are subcommands of 
the source program (e.g.,the last rule of Fig. 2). 

A consequence of this spontaneous generation of commands is that the rea- 
soning required to prove a simulation becomes difficult and complicates the def- 
inition of the anti-stuttering measure. This section first defines an alternative 
small-step semantics for IMP that is better adapted to mechanized reasoning. 
Then, it shows that it is equivalent to the first small-step semantics. 
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4.1 Semantic Rules 


The solution adopted in CompCert is to define an original small-step style based 
on continuations, where the new semantic states become triples, as the command 
to be executed is explicitly decomposed into a sub-command c under focus, where 
computation takes place, and a context k that describes the position of the sub- 
command in the whole command; or, equivalently, a continuation that describes 
the parts of the whole command that remain to execute once the sub-command 
terminates. More precisely, the semantic states become of the shape (c,k,o), 
and the semantic judgment becomes (c, k,o) ~ (c’,k’,o’). Continuations k are 
of three kinds, defined in Fig. 4. 


— The continuation stop means that nothing remains to be done once the sub- 
command terminates. In other words, the sub-command under focus is the 
whole command. This happens either at the beginning or at the end of a 
program execution. 

— A continuation c;k means that when the sub-command terminates, we will 
then execute the command c, then continue as described by k. 

— A continuation ©(b,c,&) means that when the sub-command c terminates, 
we will then execute the loop while (b) c. When this loop terminates, we 
will continue as described by k. 


Dealing with continuations requires adding new semantic rules to define the 
execution of commands. The evaluation of expressions remains unchanged. In 
the end, there are three kinds of semantic rules (see Fig. 4): 


— Computation rules evaluate arithmetic and boolean expressions, and modify 
the triple accordingly. They are close to the rules of the previous semantics. 

— Focusing rules describe how to replace the sub-command by a sub-sub- 
command that must be executed first, enriching the continuation accord- 
ingly. 

— Resumption rules describe how to extract a continuation in order to execute 
the next sub-command. More precisely, when the sub-command under focus 
is skip, and therefore has terminated, resumption rules examine the head of 
the continuation to find the next sub-command to focus on. 


The semantics if IMP rules defines two focusing rules, one for sequences and 
one for loops. Focusing on a sequence means executing its left part, while pushing 
the right part to the current continuation. Focusing on a loop means executing its 
body, while pushing the loop to the current context. The semantics also defines 
two resumption rules. The resumption rule for a sequence is triggered when its 
left part is reduced to the skip command; it then steps to the right part of the 
sequence. The resumption rule for a loop steps to the next execution of the loop 
body. 

Thanks to continuations, semantic rules become genuine reduction rules. For 
instance, an if command is now rewritten into a sub-command, namely one of 
its branches. Moreover, as in the previous small-step semantics, termination and 
divergence are defined using transition sequences. Initial semantic states are of 
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Continuations: 
k ::= stop | c;k | O(b,c,k) stop, sequence, while 


ASSIGN (COMPUTATION) 


SEQUENCE (FOCUSING 
oaFaS>vuv Q ( ) 


((c1; €2),k, 0) ~> (c1, c2; k, 0) 


(x := a,k,o) ~ (skip, k, o[z > v]) 


IF TRUE (COMPUTATION) IF FALSE (COMPUTATION) 
at b= true at b => false 

(if (b) cı else c2, k, o) ~ (c1,k,o) (if (b) cı else c2, k, o) ~ (c2, k, 0) 

WHILE DONE (COMPUTATION) WHILE LOOP (COMPUTATION + FOCUSING) 
oF b = false ot b> true 

(while (b) c,k,o) ~ (skip, k,o) (while (b) c,k,o) ~ (c,O(b, c, k), o) 

SKIP SEQUENCE (RESUMPTION) SKIP WHILE (RESUMPTION) 

(skip, c;k, 0) ~> (c,k,o) (skip, O(b, c,k),0) ~ (while (b) c,k,o) 


Fig. 4: Continuation-based small-step semantics for IMP 


the shape (c,stop,o;) and final states are of the shape (skip, stop, øf). Given 
initial and final stores g; and of, the termination of a command c is defined as 
kterminates(o;, c, of) = 
the diverging execution of c is defined as kdiverges(o;, c) £ (c, stop, aj) ~> 


(c, stop, oi) ~»* (skip, stop, af). Given an initial store oj, 
co 


4.2 Equivalence between the Two small-step Semantics 


The equivalence between the two small-step semantics states that they agree 
on which commands terminate and which commands diverge. In other words, it 
amounts to the two following properties. 


Theorem 3 (Equivalence of terminating behaviors). 
Ve, oi, of. terminates(c, oi, of) <> kterminates(c, oi, opf). 


Theorem 4 (Equivalence of diverging behaviors). 
Vc, ci. diverges(c, oi)  kdiverges(c, 0;). 


We use a simulation diagram to prove each theorem in a direction. More pre- 
cisely, we only have to define the matching invariant ~ between semantic states, 
the anti-stuttering measure between source states. Conducting these proofs is 
yet another opportunity to validate these semantics. 

As an example, we show that every transition of the continuation semantics is 
simulated by zero, one or several reduction steps. Given a semantic state (c, k, o) 
the measure is defined by a recursive function that counts the nesting of sequence 
operators constructs in c. The invariant (c,k,o) ~ (c’,o’) is defined in Fig. 5. 
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BUILD STOP BUILD SEQ BUILD LOOP 
(stop, c) > c Ruue pee (ki, (c;while (b) ¢)) > c 
i ((a1;k1),c) > ce (Olli, a, ki), c) oe 


MATCHING INVARIANT 
(kc) => 


(c,k, 0) © (c’,o) 


Fig. 5: Equivalence between the two semantics: matching invariant 


The command c’ is computed from the command c following the — function, 
that takes the sub-command c and the continuation k, and rebuilds the whole 
command. This is achieved by inserting c to the left of the nested sequence 
constructors described by k. For instance, the second rule builds a sequence of 
commands from the left command of a sequence and the sequence continuations 
related to it. The proof of the simulation proceeds by structural induction on 
continuations. 


5 Clight Semantics 


Simulation-based proof techniques scale to realistic languages such as C and 
continuation-based semantics are the privileged style to facilitate compiler cor- 
rectness proofs, as shown by their use in the CompCert compiler. There are two 
C-like languages in CompCert, CompCertC the source language of the compiler 
and Clight, that is a choice language to reason on C programs. This section 
introduces some background on CompCert generic semantics. Then, it defines 
the Clight semantics. 


5.1 Form IMP to CompCert 


In order to model the execution of programs written in realistic languages such 
as C, the semantic judgments introduced in Section 4.1 need to be extended 
in three directions. First, C programs are composed of two kinds of functions, 
depending whether they are defined in the program (internal) or not (external, 
that are declared with a name and a signature). So, to ensure some guarantees 
on external functions, the semantics observe traces of input/output operations 
performed during execution. These traces belong to program behaviors. Second, 
because of pointer arithmetic, variables need to be generalized to left values, and 
the store becomes a memory model storing different kinds of values, with different 
permissions to prevent memory overflows. Third, because of the presence of 
global, local and temporary variables and functions, semantic states are more 
involved. This section gives the background to understand these three extensions 
that are explained in more detail in [9, 24, 26]. 


From Mechanized Semantics to Verified Compilation 11 


Instrumenting the semantics to collect traces of observables. Traces 
of input/output operations (e.g., memory accesses to global volatile variables 
used by hardware devices) are part of the observed behavior. The correctness 
theorem is strengthened to show preservation of these observable effects (that 
can not modify memory), and it becomes: if the source program terminates 
(resp. diverges) and performs observable effects t, then the generated program 
terminates (resp. diverges) and performs the same effects t, and has no other 


behavior. Semantic judgments S — S’ become S 4 S’, where the trace t is a 
list of (possibly infinite) events. An execution step S 5 S’ means that no event 
is triggered during this step. 


Memory model. The memory model of CompCert is shared by all the lan- 
guages of the compiler. It provides an abstract view of memory refined into a 
concrete memory layout. The memory is a collection of disjoint blocks identified 
by memory addresses, and with fixed lower and upper bounds. Blocks store val- 
ues (i.e., byte-sized quantities) that can be either machine integers (stored on 32 
and 64 bits), pointers, floating-point numbers, or undef. A pointer (or a memory 
location) is a pair (¢,5) made of a block identifier and an integer offset within 
that block. The special undef value is also used to denote arbitrary bit patterns, 
such as the value of uninitialized variables. 

Basic memory operations are load, store, alloc, and free operations. Among 
the properties of memory operations are good variables properties, that ensure 
memory safety (e.g., no out-of-bound array access) in terminating and diverging 
executions of programs. Moreover, memory operations are preserved by generic 
memory transformations called extensions and injections. They preserve the 
properties of memory operations. Last, in the C semantics of CompCert, each 
variable allocation creates a new block, and the number of blocks decreases dur- 
ing compilation. 


Semantics states. Three environments are used in the semantic judgments for 
Clight, in addition to the memory store. 


— A global environment G maps global variables to memory blocks, and func- 
tion pointers to their definitions. It does not change during evaluation and 
execution. 

— A local environment o maps local variables to pairs made of a memory block 
and a type. 

— A temporary environment o; maps local temporaries (namely a special class 
of local variables that do not reside in memory and whose address cannot 
be taken) to values. 


Semantic states all carry a memory store M, mapping addresses to values, 
and a continuation k materializing the call stack. These states are of three kinds: 


— regular states S(f,c,k,o,01, M), that are execution points within an internal 
function f at statement c, 
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Statements 
c i= skip 


switch (a7) ls 


empty statement 


aj = ay? assignment to a left value 

id + a" assignment to a temporary variable 
(ay = ata)" function call 

(at)? =ef ti (a7)* builtin invocation 

C1; C2 sequence 

if (a7) cı else c2 conditional 


multi-way test and branch 


loop (c1) c2 infinite loop 

break exit from the current loop 
continue next iteration of the current loop 
return a” return from current function 

lbl : c labeled statement 

goto lbl jump to a label 


Switch cases: 
ls ::= e | (Ibl? sc) x: ls 


Fig. 6: Clight syntax 


— call states C (Fd, v*, k, M), that are reached each time a function defined by 
Fd is called; the state carries the parameters passing v* from the caller, 
— return states R(v, k, M) from a caller to a callee, with resulting value v. 


5.2 Clight Syntax 


The syntax of Clight is defined in Fig. 6. Clight is a simplified version of the 
CompCertC source language of CompCert, where expressions are pure, and as- 
signments and function calls are commands instead of expressions. Clight ex- 
pressions are annotated with their types and written a7; expressions are not 
detailed in this paper as they are similar to those defined in [9]. A novelty in 
expressions is the bitfield access mode for members of struct or unions. 

Base statements are skip, assignments, function calls (with optional assign- 
ment of the return value to a local variable) and builtin invocations, break, 
continue and function return. Other statements describe the control flow: se- 
quences, conditionals, loops, switch and goto statements. 

An infinite loop written loop (c1) c2 executes cı then cz repeatedly. It is equiv- 
alent to the C loop written for ( ; ; cl) c2. A continue in cı branches to c2. 
The three C loops are derived forms; a while loop while (e) c is defined as 
loop ({if (e) skip else break}; c) skip, and a for loop for (c1; a2; c3) c4 is defined 
as the sequence cı; loop (if (a2) skip else break; c3) c4. A switch statement con- 
sists of an expression and a list of cases. A case is a labeled statement |lbl] : c 
or the default case €: c. 

A program is composed of several definitions of functions, global variables 
and struct and union types. A function definition Fd is either internal(f) or 
external(ef, targs, tres, cconv). The definition of an internal function f is composed 
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of a signature, local variables and a body (namely a statement, called f.body). 
The definition of an external function ef only declares its signature. 

The signature of a function f is composed of a return type called f.return, 
the types of parameters and information cconv related to calling conventions 
(e.g., the possibility to return struct for functions, or the use of old-style unpro- 
totyped functions). External functions model input/output operations; they in- 
clude system calls and compiler built-in functions (e.g., volatile reads and stores, 
memory allocation and deallocation, and copy of memory blocks). Function calls 
and built-in invocations are annotated with their signature. 


5.3 Clight Semantics 


The semantics of Clight is defined by the following semantic judgments. The 
terminating (resp. diverging) execution of a whole program is defined using the 
relation —>* (resp. >°°), as in Section 3. 


— The big-step evaluation G,o,o1, M F aj! < (€,6),b of an expression aj? in 
left-value position results in a memory location (£, ô) that contains the value 
of aj’ and the bitfield designation b, that is the access mode for members of 
structs or unions (either a plain field or a bitfield). 

— The big-step evaluation G, 0,01, M F a’ = v of an expression a7 computes 
its value v. 

— The big-step evaluation G,o,o;, M F (a7)* = v* of a list of expressions 
computes a list of values. 

— The small-step execution Gt S +; S’ from a semantic state S steps to state 
S’ and emits trace t. 


The semantic rules for statements are defined in Fig. 7, Fig. 8 and Fig. 9. 
The rules of Fig. 7 and Fig. 8 step within the currently-executing function and 
do not trigger any external event, hence the empty trace € in the rules. Fig. 7 
defines the continuations for these statements and the semantics of assignments, 
sequences of statements, loops, break and continue statements. The rule for if 
statements is not shown as it is similar to the rule of Fig. 4. 

As in Fig. 4, a continuation k consists of the remainder of a command c and a 
control stack that describes the context in which k occurs. The stop and sequence 
(;) continuations are defined as in Fig. 4. Two continuations are defined for loops: 
©(c1, c2,k) means after cı in loop (c1) c2, and OO(c1, c2, k) means after cz in 
this loop. A continuation (k) is defined to catch in k a break statement arising 
out of a switch statement. To handle a call to a function f, we need a new form 
~»(a?, f,o,01,k) of continuation representing pending function calls in k, given 
the local (resp. temporary) environment ø (resp. a7) of the calling function and 
the optional identifier x where the result is stored. 

An assignment aj’ := aj? to a left-value aj’ evaluates aj? to a memory 
location (£, ô), and expression aj! to value v2, then casts vg into v in order to 
take into account the types of both expressions. The value v is stored at this 
memory location, which may fail. Last, the memory M’ is returned after storing 
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Continuations: 
k ::= stop | c;k | O(c, c, k) | OO(c,c,k) stop, sequence, loops 
| A(R) | ~(a’, f,o,01,k) switch, call 


ASSIGN (COMPUTATION) 
G,o,o1,M F a! =(£,6),b G,o,01, M F a? > v2 
semCast(v2, a3?, a+, m) = [v] GE m,m, (£, 8) : b, v, m' 


Gr S(f, (aj = ay”), k, 0,01, M) = S(f, skip, k, 7,01, M’) 


SET (COMPUTATION) 
G,o,o, M F a >v 
Gt S(f, (id = a”), k,o, 01, M) & S(f, skip, k, o, ocilid > v], M) 


SEQUENCE (FOCUSING) 
Gr S(f, (c1; c2), k, 0, o, M) > S(f, C1, C23k, 9, oi, M) 


SKIP SEQUENCE (RESUMPTION) 
G H S(f, skip, c; k, 0,01, M) > S(f,¢,k,o, 01, M) 


CONTINUE SEQUENCE (RESUMPTION) 
Gt S(f, continue, c; k, o, o1, M) 5 S(f, continue, k, o, o1, M) 


BREAK SEQUENCE (RESUMPTION) 
Gt S(f, break, c; k, o, o1, M) > S(f, break, k, o, o1, M) 


LOOP (COMPUTATION + FOCUSING) 
Gt S(f, (loop (c1) c2), k, 0,01, M) > S(f, cr, O(c1, c2, k), 0, 01, M) 


SKIP OR CONTINUE LOOP (RESUMPTION) 
x € {skip; continue} 


Gr S(f, x, O(c, c2, k), 0, 01, M) S S(f, c2, OO(c1, c2, k), 0,01, M) 


BREAK LOOP1 (RESUMPTION) 
Gk S(f, break, ©(c1, c2, k), 0,01, M) > S(f, skip, k, 0,01, M) 


BREAK LOOP2 (RESUMPTION) 
Gt S(f, break, O0(c1, c2, k), 0,01, M) > S(f, skip, k, o, o1, M) 


SKIP LOOP (RESUMPTION) 
G E S(f,skip, OO(c1, c2, k), 0,01, M) + S(f, loop (c1) c2,k, 0,01, M) 


Fig. 7: Clight semantics for statements (first rules) 


the value v in the datum of type T stored at memory location (£, ô), and the 
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LABEL (COMPUTATION) 
GE S(f, (bl : c),k,o,01,M) 3 S(f, c, k, 0,01, M) 


GOTO (COMPUTATION + FOCUSING) 
findLabel(Ibl, f.body, callCont(k)) = |(c’, k’) | 


GE S(f, (goto lbl),k,o,01,M) 3 S(f,c', k’,0,01, M) 


SWITCH (COMPUTATION + FOCUSING) 
G,o,01,M' a’ =v semSwitchArg(v, 7) = |lbl] 
GE S(f, (switch (a’) sl), k, o, o1, M) > S( f, seq(selectSwitch(lbl) = sl), Z (k), 0,01, M) 


SKIP BREAK SWITCH (RESUMPTION) 
x € {skip; break} 
Gr S(f,x, (k), 0,01, M) = S(f, skip, k, 0,01, M) 


CONTINUE SWITCH (RESUMPTION) 
G+ S(f, continue, Z (k), 0,01, M) & S(f, continue, k, o, 01, M) 


Fig. 8: Clight semantics for goto and switch statements 


statement is reduced to skip. An assignment id < a” to a temporary variable id 
evaluates a7 to a value v and updates the local environment accordingly. 


The two rules for sequences are similar to the rules given in Fig. 4. The 
execution of a continue statement in a loop body interrupts the current execution 
of this loop body and triggers its next iteration. So, when a continue statement 
is after cı in a loop loop (c1) c2, then cz is the next statement to execute and 
the continuation is updated accordingly. 


The execution of a break statement in a loop body terminates the execution of 
the current loop body. So, the statements cı and cz of the loop body are popped 
from the continuation stack. Moreover, when a continue or a break statement is 
followed by a statement c, then c is not executed, hence it is popped from the 
continuation stack. The resumption rule for loops steps to the execution of the 
next execution of the loop body, when the continuation is a O© continuation. 


Fig. 8 defines the semantics of labeled, goto and switch statements. The 
execution of a labeled statement [bl : c steps to the execution of c. The execution 
of a gotolbl statement in a function f first pops the continuation stack k until 
a call or a stop, in order to remove from k its local context part. Then, from 
this continuation callCont(k) representing the control flow from the last caller 
of f, findLabel computes recursively (if any) the control flow in f from its entry 
point until the statement labeled Jbl. A new continuation k’ that extends k and 
represents this control flow is then manufactured, and findLabel returns (if any) 
the pair (c’, k’), where c’ is the leftmost sub-statement of c labeled lbl. The rule 
thus steps to statement c’ and continuation k’, with no change in environments. 
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The execution of a statement switch (a7) sl first evaluates a7 into value v, 
which is then casted into an unsigned integer when 7 is an integer type (and 
fails otherwise). The rule steps to the appropriate case of the switch, given the 
value of the selector expression, and the corresponding statements are executed 
(after being converted into a sequence of statements from a labeled statement). 
In other words, the rules focus on a case switch and the continuation remembers 
this control flow. This rule is general enough to model executions of unstructured 
switch statements such as Duff’s device [14]. 


The execution of a break statement in a switch case terminates the execution 
of this case. In other words, the execution of break (or a skip) statement in a 
switch case steps to skip and updates the continuation into k. The execution of 
a continue statement in a switch case updates the continuation into k as well, 
while keeping the continue statement as the current statement. 


The semantic rules involving call and return states are defined in Fig. 9. First, 
the rule for a call to an internal function identified by a; evaluates a,’ into v 
and each argument a’ of the function. The value v identifies the block where 
the function definition Fd is stored in the global environment G, and funct(G, v) 
returns this definition if any. The rule requires that the signature of the called 
function matches the signature 7; annotating the call, namely t¢#sigOf(Fd). 


The rule for a builtin invocation also evaluates the list of its arguments. A 
builtin is an external function ef and the rule applies ef to arguments v*: it 
mainly checks that the builtin is known, that ef cannot modify the memory 
state M, that v* are integers or floats and that they agree in number and types 
with the function signature (see [24]). 


The execution of a return statement frees in memory M all the blocks of the 
current environment g, and steps to a return state with the retuned value in any 
(or undef otherwise), and updated continuation and memory state. 


A step from a callstate with an internal function f steps to a regular state 
to further execute the statements f.body of f. The semantics for allocation of 
variables (hence the modified memory M’) and binding of parameters is given 
by functionEntry(f,v*, ,0,0,,M’). Two semantics are supported, one where 
parameters are local variables, reside in memory, and can have their address 
taken, and the other where parameters are temporary variables and do not reside 
in memory. 


A step from a callstate with an external function ef steps directly to a return 
state (to further return to its caller) after generating the appropriate event in 
the trace t. Moreover, the rule applies ef to arguments v*, to perform similar 
checks to those performed by the rule for builtin invocation. Last, a step from 
a return state either ends the program execution (when the call stack becomes 
empty) or reaches the regular state of the caller that carries a skip statement 
and the returned value v stored in the local environment. 
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FUNCTION CALL 
G,o,o1,Mb af >v G,o,o1,Mt (a) > 0" 
funct(G,v) =|Fd| = r¢#sigOf(Fd) 


Gr S(f, id’ = a,” ((a7)*), k0, oi, M) £ C(Fd, v*, (id, f,0,01,k), M) 


BUILTIN INVOCATION 
t 
G,0o,0, M H (a7)* >v” GF ef(w*), M >v, M’ 


GE S(f, id’ =ef Tón (a7), k, 0,01, M) > S(f, skip, k,o, ofid + v}, M) 


RETURN 1 
semCast(v, 7, f.return,m) = |v'] freeAll(M,o) = | M’| 


Gt S(f, return |a" ], k, o, o1, M) > R(v’, callCont(k), M’) 


RETURN 0 
freeAll(M,o) = |.M’| 


Gt S(f, return €, k, o, 01, M) — R(undef, callCont(k), M’) 


SKIP CALL 
freeAll(M,o) = |.M’| 


Gt S(f, skip, k, 0,01, M) & R (undef, k, M’) 


INTERNAL FUNCTION 
functionEntry(f,v*, M, o, o1, M’) 


Gt C(internal( f), v“, k, M) > S(f, f.body, k, 0,01, M”) 


EXTERNAL FUNCTION 
t 
Gk ef(v*), M >v, M’ 


Gt C(external(ef, targs, tres, cconv), v* , k, M) a R(v, k, m’) 


RETURNSTATE ; 
GE R(v, (id’, f,0,01,k), M) & S(f, skip, k, 0, rfid’ + v}, M) 


Fig. 9: Clight semantics for functions 


6 Related Work 


The semantics of the Clight language were first mechanized using big-step seman- 
tics [9] that were targeting a smaller language and only observing terminating 
behaviors. Then, a co-inductive interpretation of big-step semantics for diverg- 
ing behaviors was defined [28]. However, this approach did not scale to conduct 
compiler correctness proofs of CompCert, contrary to the current continuation- 
based small-step semantics. Indeed, the cost for extending the correctness proof 
to diverging behaviors was relatively high (and Coq support for coinductive 
proofs is temperamental). Compared to [9], the Clight language was extended 
to model assignments of temporary variables, single infinite loops (instead of C 
lops), labeled and general goto statements and switch statements. 
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Other mechanized semantics were defined for realistic languages such as Java, 
the JVM [20] and JavaScript [10]. In [20], the authors define a big-step semantics 
and a small-step semantics, which are proved equivalent. A correctness proof of 
a two-stage compiler from Java to a virtual machine is proved correct using 
the simulation proof technique. These semantics target a simpler compiler than 
CompCert and only observe terminating behaviors and do not use continuations. 

The idea of using continuations to facilitate some mechanized semantic rea- 
soning first appeared in [4], where an axiomatic semantics (a.k.a. program logics) 
was defined from an operational semantics. The considered language was Cminor, 
a lower-level language than Clight, that is the target language of the CompCert 
front-end. Thanks to continuations, the soundness proof of the axiomatic se- 
mantics reuses the induction principles generated by Coq, thus avoiding to craft 
error-prone induction principles. Continuation-based small-step semantics were 
then used in the backend of the CompCert compiler [24]. 


7 Conclusion 


This paper presented some operational styles for defining mechanized semantics 
of programming languages, starting from a toy imperative language to the C lan- 
guage. Exploration on toy languages is essential, but the results do not directly 
scale to big languages. This paper details the Clight semantics of CompCert, a 
reasonable proposal that works well in the context of compiler verification and 
a choice language to reason on C programs. 

The continuation-based small-step semantics style detailed in this paper is the 
style chosen for all the languages of the CompCert compiler. It models terminat- 
ing and diverging executions of programs and facilitates the semantic reasoning 
using simulation proof techniques. 

Mechanized semantics is a need shared by many verification efforts, not just 
verified compilation. It is still a difficult task, especially for realistic programming 
languages. Better tooling for defining and maintaining mechanized semantics for 
realistic languages is needed. 
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Abstract. In model-driven engineering, runtime monitoring of systems 
with complex dynamic structures is typically performed via a runtime 
model capturing a snapshot of the system state: the model is represented 
as a graph and properties of interest as graph queries which are evaluated 
over the model online. For temporal properties, history-aware runtime 
models encode a trace of timestamped snapshots, which is monitored 
via temporal graph queries. In this case, the query evaluation needs to 
consider that a trace may be incomplete, thus future changes to the 
model may affect current answers. So far there is no formal foundation for 
query-based monitoring over runtime models encoding incomplete traces. 
In this paper, we present a systematic and formal treatment of incomplete 
traces. First, we introduce a new definite semantics for a first-order 
temporal graph logic which only returns answers if no future change 
to the model will affect them. Then, we adjust the query evaluation 
semantics of a querying approach we previously presented, which is based 
on this logic, to the definite semantics of the logic. Lastly, we enable 
the approach to keep to its efficient query evaluation technique, while 
returning (the more costly) definite answers. 


1 Introduction 


Modern safety-critical systems, e.g., smart healthcare and autonomous trans- 
portation, consist of numerous interconnected technologies such as sensors, smart 
devices, and information systems [15]. These systems are human-in-the-loop and 
operate in highly dynamic environments [16]. Moreover, they are real-time, i.e., 
their safe operation depends on the timing of their actions, and missed deadlines 
for these actions may lead to hazardous situations [46]. These characteristics 
hinder complete quality assurance during the design of such systems and increase 
the uncertainty about their behavior at runtime. Consequently, their safe opera- 
tion relies on formally precise Runtime Monitoring (RM) techniques [34], which 
are capable of handling the complex underlying structure and its dynamic [13] 
as well as timing constraints when monitoring the system behavior [4]. 

As shown by recent surveys [9, 52], in model-driven engineering, RM of 
systems with complex dynamic structures is typically performed via a (structural) 
Runtime Model (RTM) [12] capturing a snapshot of the system state: the model 
is represented as a graph of interacting components and properties of interest 
© The Author(s) 2024 
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as graph queries which are evaluated over the model online; query matches 
constitute monitoring issues. For efficiency, the evaluation of graph queries is 
based on methods which afford incremental and change-driven evaluation [54], 
i.e., triggered only when changes to the RTM are relevant to a query. 


For temporal properties, history-aware RTMs capture past changes to the 
model and their timing [11], thereby encoding a trace of timestamped snapshots. 
These RTMs are then monitored via the evaluation of temporal graph queries 
which specify the ordering and timing constraints that matches should satisfy. In 
this case, the query evaluation needs to consider that the trace encoded by the 
history-aware RTM may be incomplete, i.e., the execution may be ongoing, and 
hence future changes to the RTM may affect current query answers. So far there 
is no formal foundation for temporal-query-based RM over incomplete RTMs. 


In our previous work, we presented a querying approach for the evaluation 
of temporal graph queries over history-aware RTMs named INTEMPO [49]—see 
Section 2.3 for an overview and Fig. 1 for an illustration. INTEMPO advances the 
state-of-the-art by: enabling a formally precise answer set which pairs matches 
with their temporal validity, i.e., the set of all time points for which a match 
exists and satisfies a temporal property according to a first-order temporal graph 
logic; featuring sound methods for incremental and change-driven evaluation as 
well as the optional pruning of the RTM, i.e., the removal of temporally irrelevant 
history. Extensive experimental evaluation showed that our implementation of 
INTEMPO efficiently evaluated complex queries over considerably large models 
(approx. from 10K to 48M elements) [49]. The experimental evaluation included 
an RM application scenario, in which INTEMPO evaluated queries faster than an 
RTM-based tool and a tool from the related RM approach known as Runtime 
Verification (RV). 

However, the formal foundation of INTEMPO assumes that the RTM encodes 
a complete trace. For the RM scenario, we equipped INTEMPO with a check 
that was applied to the answer set and, based on the timing constraint of the 
property, filtered matches that could be affected by future changes to the RTM. 
In this paper, we present a formal foundation for temporal-query-based RM over 
incomplete RTMs. The foundation entails the introduction of an answer set which 
formalizes the intuition behind the check and allows approaches like INTEMPO 
to maintain their efficiency while returning formally precise answers. 


Specifically, our contributions are the following. First, we introduce a definite 
semantics for a temporal graph logic (Section 3), which only returns answers if 
they are definite, i.e., no future change to the RTM will affect them; we show 
that the definite semantics is sound. Then, we introduce a new definite answer set 
(Section 4) for the query language of INTEMPO which pairs matches with their 
definite temporal validity and invalidity. Compared to the original (non-definite) 
answer set, the definite answer relies on the time point on which a query is 
evaluated and thus requires the re-computation of the definite temporal validity 
and invalidity in each evaluation. The definite answer set is thus inefficient, i.e., 
not amenable to change-driven evaluation. However, we use this theoretical result 
to show that our last contribution, the effective answer set (Section 5), which 
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cts : long Service [P| SHSService ine operationalization _ operationalization | event trace 
dts : long ? } (0.."] + \ Vv (he) 

4 (0.."][ rahe D w evaluation | evaluation 

; rE F oa 

Probe DrugService || PMonitoringService mapping of events 7 oP “answer set 

status : string pID : int pID : int to modifications (£) RTM zil pruning (Tor T) 


Fig. 1: An excerpt of the SHS metamodel from [49] (left) and an operational 
overview of the INTEMPO implementation where arrows denote input and output. 


essentially incorporates the check mentioned above, can return definite answers 
while relying on the original, and thus efficient, answer set. 

The presented contributions are based on unpublished material from the 
doctoral thesis of the first author [47]. Section 2 reiterates preliminaries and 
INTEMPO, Section 6 discusses related work, and Section 7 concludes the paper. 
Running Example As a running example we will use the Smart Healthcare 
System (SHS) introduced in [49]. Fig. 1 shows an excerpt of the SHS metamodel. 
An SHS is an envisioned smart medical environment [45], based on the service- 
based exemplar in [55], which supports clinicians in medical treatments by 
automating tasks via smart devices. In the context of an SHS, RM may be used 
to verify whether treatments comply with the requirements in a guideline, which 
typically contain timing constraints [17]. In the SHS, services are invoked by a 
main service called SHSService to collect measurements from patient sensors, 

e., PMonitoringService, or take medical actions via smart medical devices 
such as a smart pump, i.e., DrugService. The results of service invocations are 
tracked via monitoring probes (Probe) that are attached to Services. Probes 
are generated periodically or upon events in the real world. Each Probe has a 
status attribute whose value depends on the type of Service. Each Service has 
a pID attribute which identifies the patient for whom the Service is invoked. 
The MonitorableEntity is explained in Section 2.1. 

We focus on a property P that tracks time between triage and admission, as 
often done in medical guidelines [39]; in the context of an SHS, these activities are 
represented by the invocation of a sensor service and a drug service, respectively: 
“When a sensor service is invoked for a patient, there should be a drug service 
invoked for the same patient within one minute and, until then, there should 
be no other sensor service invoked for the same patient.” The specific timing 
constraint is adjusted for the purpose of presentation. Assume an RTM that 
captures that a sensor service has just been invoked for a patient, but contains no 
drug invocation yet; for monitoring P, it is important to consider that a future 
state which contains the drug service invocation may follow in time; therefore, 
the present state does not yet violate P. 


2 Preliminaries 


In this section, we summarize preliminaries and the INTEMPO query language. 
An overview of the notation used in the paper is shown in Table 2 in Section A. 
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m s:SHSService pm:PMonitoringService N 
Mm, ap) 
nia {pm.pID = pm2.pID} 
s:SHSService (n Pp) 
pm2:PMonitoringService pm:PMonitoringService 
g 2 (m, 7411) (n, 442) 
N42 - {pm.pID = d.pID} Î Î 
s:SHSService (m an1) (niz true) 
d:DrugService pm:PMonitoringService (ma true) 


Fig.2: Patterns for the SHS (left) and the GDN N for the query (n, =4p). 


2.1 Formal Representation of Models and Queries 


An RTM is typically represented as a graph, where system entities are captured by 
vertices, information about the entities by attributes, and relationships between 
entities by edges |25, 14, 24]. In this paper, for the formal representation of RTMs, 
we rely on the well-known typed graphs [20], i.e., graphs typed over a type graph 
which defines types of vertices, edges, and valid structures for typed graphs. 


Definition 1 ((typed) graph, (typed) graph morphism, type graph). A 
graph G = (GY, G®,s%,t°) consists of a set of vertices GY, a set of edges GF, 
a source function s° : GE + GY, and a target function t° : GE + GY. Given 
two graphs G = (GV, G”,s%,t@) and K = (KV, K*”,s*,t*), a graph morphism 
f:G—K is a pair of mappings fY : GY 3 KY, f® : GE > KE such that 
fV os = 8% o fF and fY ot =t* o fF. A graph morphism f :G— K is a 
monomorphism, denoted by >, if fY and fF are injective. A type graph is a 
distinguished graph TG = (TGY ,TG®,s?¢,t?®). A tuple (G, type) consisting of 
a graph G and a graph morphism type : G + TG is called a typed graph. Given 
two typed graphs GT = (G,type) and KT = (K, type’), a typed graph morphism 
f:G? — KT is a graph morphism f' : G— K such that type! o f! = type. 


Type graphs can be extended to support the well-known concepts of inheritance 
and multiplicities from the object-oriented paradigm [53]. Moreover, typed graphs 
can be extended by vertex and edge attributes, each associated with a data type, 
i.e., a character string, an integer, a real number, or a boolean, to obtain typed 
attributed graphs [20]. Attribute assignments assign data-type-compatible values 
to attributes, and attribute constraints, i.e., a boolean expression over attribute 
values, restrict the possible assignments. Our contributions rely on such graphs, 
defined in detail in our prior work [50]; to avoid the complication of presentation, 
here we omit these extensions from our definitions. 

The metamodel in Fig. 1 may be seen as an informal representation of the 
type graph of the SHS, where only vertices have attributes. Correspondingly, the 
RIM G7 in Fig. 3 is an informal representation of a typed attributed graph. We 
henceforth refer to typed attributed graphs simply as graphs or patterns. The 
RTM G7 contains assignments, which assign values to attributes, e.g., pmı.pID 
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G2 s:SHSService His] s:SHSService 
cts= 2 
dts= 00 
G4 s:SHSService 
d1:DrugService pm1:PMonitoringService 
pm1:PMonitoringService pID= 1 pID= 1 
piD= 1 cts=5 cts= 4 
dts= o0 dts= oo 
Gs s:SHSService Hi i 
7 
= —_— - s:SHSService pm1:PMonitoringService 
d1:DrugService pm1:PMonitoringService cts=2 pID= 1 
pID= 1 pID= 1 dts= œ cts= 4 
dts= œ% 
G7 s:SHSService d1:DrugService pm2:PMonitoringService 
pID= 1 pID= 2 
pm2:PMonitoringService pm1:PMonitoringService cts= 5 cts= 7 
pID= 2 pID= 1 dts= 7 dts= 00 


Fig. 3: Snapshots as RTMs (G,) and traces as RTM# instances (Hy). 


= 1. The representation of the textual statements in property P of the running 
example by patterns is illustrated in Fig. 2: The invocation of a sensor service is 
captured in patterns nı and n1,;, and the invocation of a drug service is captured 
in n 1,2; constraints are illustrated between braces, e.g., n1.1 requires that the 
values for pID of pm and pm2 are equal; vertices with the same label refer to the 
same vertex in the queried RTM. 

We assume that the system is instrumented to generate (instantaneous) events 
upon changes to its state, and identify the system execution with a possibly 
infinite sequence of such events. The system has a clock whose time domain is the 
set of non-negative real numbers R, and uses the clock to timestamp events. We 
refer to an element of the time domain as a time point. Intuitively, an (execution) 
trace h, of a system with respect to an event at time point 7 is the sequence of 
all observed events in the execution from its beginning, i.e., time point 0, up 
to and including 7. For brevity, we group all changes with the same time point 
in one event. However, we require that no event groups an infinite amount of 
changes, thereby ruling out Zeno behaviors—in the use-cases of interest, all traces 
will eventually terminate and differences between measurements cannot become 
infinitely small. We denote the time point at position i of h, by 7;, with i € NF. 

For a model-based representation of a trace h}, we rely on a Runtime Model 
with History (RTM®) [49]. An RTM# 4 is a distinguished RTM where the fol- 
lowing conditions hold. All vertices in H have a distinguished creation timestamp 
cts and a deletion timestamp dts to which a value is assigned—therefore in Fig. 1, 
all vertices inherit from the MonitorableEntity.? When a vertex is created, the 
time point of creation is assigned to cts and the value oo is assigned to the dts; 
the dts value changes when the vertex is deleted in the modeled system. As a 
vertex cannot have been deleted prior to its creation or deleted simultaneously 
to its creation, the value of dts, if not co, has to be larger than the value of cts. 


3 If tracking changes to attribute values or edges in an RTM is of importance, those 
can be modeled as vertices, which is a customary modeling technique, e.g., [36]. 
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An h, can be transformed to an RTM! H based on a mapping & from the set 
of all possible events to corresponding graph modifications [48]; to capture the 
period covered by H in this case, we denote it by H),). Each trace continuation 
hy that is yielded by an event at time point r’ with 7’ > 7 can be similarly 
transformed to a A, by applying the changes in the event at 7’ to Hiz]; we 
refer to Hjir] as a new version of Hj}. This process generates a trace of RTMs 
hE , called an RTM” -trace, which mirrors h’; we refer to members of he as 
instances of the RTM. Formally, an H [7] İs a compact representation of a timed 
graph sequence [26], i.e., a sequence of timestamped graphs where additions and 
deletions between two consecutive graphs are represented by morphisms. As an 
example of an RTM, see H [5] in Fig. 3 which contains all changes in events up 
to time point 5; Hj5) represents the timed graph sequence G2G4Gs (left in Fig. 3; 
morphisms are omitted). A new event at time point 7 which contains the deletion 
of d1, and the addition of pm2 is transformed into Hy; this RTM represents the 
sequence G2G4G5G7. If T in h,, hë, or Hj- is irrelevant, we omit it. 


2.2 Metric Temporal Graph Logic 


For the specification and analysis of temporal properties in temporal queries, 
INTEMPO relies on the Metric Temporal Graph Logic (MTGL) [50, 26]. MTGL 
builds on Nested Graph Conditions (NGCs) [27] and Metric Temporal Logic 
(MTL) [35] to enable the formulation of Metric Temporal Graph Conditions (MT- 
GCs). The language of NGCs can formulate requirements that are as expressive 
as first-order logic on graphs [18], as shown in [27, 44], and constitutes as such a 
natural formal foundation for pattern-based queries. As NGCs, MTGCs support 
bindings, i.e., morphisms between patterns which bind elements in outer condi- 
tions to inner (nested) conditions, and are therefore able to track the evolution 
of a given binding in a sequence of graphs separately to other bindings. 

In the following definition of MTGL, we focus on a subset of MTGL operators 
which contains the metric, i.e., interval-based, temporal operators until (Uz, with 
I an interval in RẸ) and its dual since (Sr) from MTL. The existential quantifier 
features a binding between the patterns n and ñ. 


Definition 2 (metric temporal graph conditions). Let n,ù be patterns and 
fino ñ a binding. Moreover, let I be an interval in Re: Then w is a Metric 
Temporal Graph Condition (MTGC) over n defined as follows. 


Yn i= true | Yn | Un A Yn | A(f:n nr, pa) | Un Urn | Yn S1Yn 


In the remainder, we abbreviate 3( f, true) by 3 f and, when the domain of f is 
clear from the context, 4(f : n © À, da) by I(A, ¢). Other abbreviations, e.g., 
disjunction (V), eventually (Qr) can be defined as usual. 

Based on the patterns in Fig. 2, property P from the running example can be 
reformulated into “given a binding for nı at a time point 7, at least one binding 
for nı.2 is found at some time point T’ € [7,7 + 60], i.e., at most 60 seconds later; 
in addition, at each time point 7” € [7,7’) in between, no binding for nı. is 
present.” In MTGL, this property is captured by the MTGC wp := 74 (nı > 
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71.1, true) Ujo,e6o) 3 (nı > 11.2, true), or, abbreviated, 741.1 Ujo,6o] IN1.2. The 
system is assumed to track time in seconds; vertices s and pm from n; are bound 
in the patterns nı.ı and 71.2, i.e., all patterns refer to the same s and pm. 

MTGL reasons over (finite) timed graph sequences. However, MTGCs can also 
be equivalently checked over a graph with history [26], which here corresponds to 
an RTME. In the following, we define the semantics of the satisfaction relation 
of MTGL based on an RTM#. 


Definition 3 (satisfaction of metric temporal graph conditions over an 
RTM). Let H be an RTM", n a pattern, and m : n —> H a binding. Moreover, 
let T be a time point in Rọ and w be an MTGC over n. Then m in H satisfies 
w at T, written (H,m,T) = Y, if matecre.cts < T < Mminecge.dts, with E the 
vertices of m, and one of the following cases applies. 


— 4% = true. 

— yp =x and (H, m, T) Kx. 

— b=xAw, (H,m,T) | x, and (H, m, T) Ew. 

— p =3(f:n— û,x) and there exists th: iG H such that rho f =m and 
(H, ñ, T) F Xx. 

— y% = x Uw and there exists T’ with T'— rT € I such that (H, m,T') =w and 
for all T” € [7,7’) (H,m, T”) E x. 

— y% = xSrw and there exists T’ with T— T' € I such that (H,m,r') Ew and 
for all T” € (7',7] (H,m,7") F x. 


Intuitively, a binding m for n in the RTM H satisfies the MTGC A(f : n > fi, x) 
at time point 7 if (i) all elements of m are already created but not yet deleted at 
T, and (ii) there exists a binding M for ù in H such that ñ is compatible with 
m, i.e., respects the binding between the two patterns captured in n > ñ, and 
m satisfies the MTGC y at 7. The intuition behind true, negation, conjunction, 
until, and since is the usual. 


2.3 INTEMPO: Query Language and Overview of Operation 


INTEMPO introduces a query language, henceforth referred to as £, which has 
two distinguishing features: it enables the formulation of ordering and temporal 
constraints in MTGL, i.e., as an MTGC, thereby enabling formal precision in 
checking whether matches satisfy those constraints; it computes the period for 
which a match satisfies an MTGC, thereby enabling practical query evaluations, 
as the query does not have to be evaluated for each time point of interest. We 
summarize core concepts of graph queries and £ below. 

In its plainest form, a graph query is characterized by a pattern n. A match 
for this query is a binding from n to a queried graph which preserves structure 
and type. £ allows for the specification of temporal graph queries, i.e., queries of 
the form (n, y) with y an MTGC over n, whereby matches for n in an RTM"! H 
need to satisfy the temporal requirement captured in w. Based on the running 
example, the query (n1, wp), searches H for matches for nı, i.e., sensor services, 
which falsify wp. 
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Vertices in H have lifespans, defined by their cts and dts. Similarly, a match 
m in H is valid only if there is a non-empty interval A" = Nee gle. cts, e.dts), with 
E the vertices of m, called the lifespan of a match. According to its definition, 
the values of regular attributes in H cannot change and, hence, cannot affect A™. 
In the special case where the pattern of a query is the empty graph Ø, an (empty) 
match m is always found with \”’ = R. Temporal logics that reason over intervals, 
such as MTGL, are capable of deciding the truth value of a property for the 
entire time domain; in INTEMPO, the set of time points satisfying a property is 
called the satisfaction span and defined as Y(m, Y) = {7|7T E RA (A,m,r) H vy} 
with w an MTGC. The temporal validity V(m, p) is equal to A N Y(m, Y) and 
defined as the period for which m exists in H and satisfies w. 

The following computation, called the satisfaction computation Z of m for 
wv, soundly computes Y, as shown in [49]. The computation relies on interval 
operations defined as usual [see 41]: Let k,z be intervals; then k @ z = [¢(k) + 
l(z), r(k) + r(z)], kOz = K(k) — r(z), r(k) — &(z)] with (k) and r(k) the left 
and right end-point of k, respectively. We denote the unions ¢(k) Uk by +k, and 
kUr(k) by kt; when r(k) = 00, kt =k. The interval k is overlapping z when 
kN z #0 and adjacent to z when kN z = @ but kU z is an interval. 


Definition 4 (satisfaction computation Z). Let n, ù be patterns and p, x,w 
be MTGCs. Moreover, let m be a match for n in an RTM H, and M a set of 
matches for à that are compatible with the (enclosing) match m. The satisfaction 
computation Z(m,w) is recursively defined as follows. 


Z(m, true) = R (1) 
Z(m, =x) =R \ 2(m, x) (2) 
Z(m, x ^w) = Z(m, x) N Z(m, w) (3) 
Z(m, I, x)) = LJ x" n Zi, x) (4) 
MEM 
U jn(Gtniyed) ifo gI 
Z(m, xUrw) = 4 170e) JE (5) 


U iu Uya nge) foe! 
iEZ(m,w) jedi 
U jn (+74) eT) ifo gI 
iEZ(m,w), JES: (6) 
U tuUsn(Cinieln Frer 
iEZ(m,w) jEJi 


Z(m, xSrw) = 


with J; the set of all intervals in Z(m, x) that are either overlapping or adjacent 
to some i € Z(m, w). 


The intuition behind the equations for true, negation, and conjunction is clear. 
Regarding exists, the satisfaction span is the union of the temporal validity of 
all matches m for ù which are compatible with m. Regarding until, if 0 ¢ I, the 
satisfaction includes every time point 7 in the intersection of some i’ € Z(m,w) 
with a j’ € Z(m, x) for which a time point r’ € 2’ occurs within J. Furthermore, j’ 
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needs to overlap 7’, e.g., j’ = [1,3], i = [2,4] or be adjacent to 7’, e.g., 7’ = [1, 2), 
i’ = [2,4]. If 7’ and i’ are adjacent, during the computation j becomes right- 
closed to ensure that their intersection produces a non-empty set. If 0 € J, then, 
according to Definition 3, it may be that 7’ is empty, i.e., does not exist, and 
until is satisfied by every i’ € Z(m,w). Therefore, the computation includes every 
i’ and remains unchanged otherwise. The intuition behind since is analogous. 

The intersection of two intervals is always an interval, whereas the union of 
two intervals may result in disjoint sets. Hence, technically 2 and V are interval 
sets which may contain disjoint or empty intervals. 

We define below the answer set J for a query in £. 


Definition 5 (query answer set J). Given a pattern n, an MTGC w, and an 
RTM” H, the answer set J of a query in £ over H is given by: 


T(A) = {(m, Vim, v))|m is a match for nA V(m, Y) 4 0} 


Regarding the operation of INTEMPO (see Fig. 1), the approach expects a 
metamodel, a set of queries in £, a mapping & from events to modifications, 
and an event trace h, as input—see definitions earlier. INTEMPO operationalizes 
queries (see Section 5). For each event events in h,, INTEMPO performs the 
corresponding changes to an RTM# and, after each change, evaluates the queries. 
Pruning may follow, which triggers another query evaluation to update stored 
matches. Finally, INTEMPO returns the answer set J or, for RM, performs the 
check described in Section 1 and essentially returns matches in the effective answer 
set J? (see Section 5). In our implementation of INTEMPO, the metamodel, the 
queries, and the mapping are defined based on model-based technologies [48]. 

We present an example that demonstrates that J may contain imprecise 
answers in the context of an incomplete trace. 


Example 1 (imprecision over incomplete trace). Evaluated over Hz) in Fig. 3, the 
query (nı, =p) returns an answer set J(Hj7)) which contains a pair (m2, [7,00)); 
mg is a match for nı involving the vertex pm2, and [7, 00) is the temporal validity 
V which states that mə falsifies yp from time point 7 onward. V is the result of 
the intersection of A™? = [7,00) with Z(m2, wp) = R. The satisfaction span Z 
is computed according to Definition 4—see Table 1 for details. 

This computation is definite only if Hj7) is the last instance in an RTM#-trace; 
if the trace is incomplete, and it is to be continued by a new H,) with 7 < 67, 
the match mg may still satisfy wp, as there is still time for a DrugService to be 
created timely, i.e., a match for the pattern n1.2, which is compatible with ma, 
to be found—assuming that until then there would be no match for nı.. 


3 Definite Semantics for Metric Temporal Graph Logic 


This section presents our contribution to MTGL. Specifically, we introduce a new 
semantics, called definite, which only returns answers if they are definite, i.e., no 
future change to the RTM will affect them. Similarly to temporal logics which 
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account for RM over incomplete traces [8, 21], the definite semantics is three- 
valued, as they return the value unknown when the result of the satisfaction check 
is not definite. We show the soundness of the definite semantics in Theorem 1 
based on the regular semantics in Definition 3. Moreover, we show that for a 
certain period the definite and the regular semantics are equivalent (Theorem 2); 
this equivalence enables our contribution in Section 5, i.e., it allows INTEMPO to 
return definite answers efficiently. Finally, we demonstrate an intrinsic limitation 
of the definite semantics: we show that for unsatisfiable properties, the semantics 
may return decisions with a delay, compared to the earliest time point on which 
the decisions could have been returned. We compute the maximum possible 
magnitude of the delay (Corollary 2). 

We begin with the definition of the definite semantics. In the context of an 
RTM” Hij, a satisfaction decision for time point T € [0,c] is definite if the 
decision for T remains the same in all possible future versions of Hj}. We obtain 
the definite satisfaction span by adjusting the satisfaction relation of MTGL 
from Definition 3 to this notion of definiteness. Moreover, we obtain the definite 
falsification by negating the statements in the cases of the definite satisfaction. 
We present the adjusted satisfaction relation, called definite satisfaction relation, 
and the definite falsification relation over an RTM® below. 


Definition 6 (definite satisfaction and definite falsification of metric 
temporal graph conditions over an RTM#,). Let Hiq be a RTM”, na 
pattern, and m : n > Hi a match. Moreover, let T E€ R be a time point and 
w be an MTGC over n. Then the definite satisfaction relation =% and definite 
falsification relation =% are defined via mutual recursion as follows. The match 
m definitely satisfies Y at T, written (Hig, m, T) H° p, iff T € A" N [0,c], orm 
is the empty match, and one of the following cases applies. 


— w = true. 
— p= yx and (Hig, m, T) ES x- 
- y =x^uw, (Ha, m, T) =? x, and (Hig, m, T) =$ w. 
[c] X [e] 
—~w=A(f:non,x) and there exists M : à Hy such that mo f =m and 
L le 


(Hig M, T) H? x. 
— ~= xUrw and there exists T' with T'— T € I such that (Hig, m, T’) H° w 
and for all T” € [r, T") (Hig; m, T”) H? x. 
— 4% = xSrw and there exists T' with T— T' € I such that (Hig, m, T’) H? w 
and for all T” € (7',7] (Hiq, m, T”) ° x. 


The definite falsification relation is based on a logical negation of the statements 
in the cases of the definite satisfaction relation. The match m definitely falsifies 
Y at T, written (Hig, m,T) Ed y, iff r Ee A™N ([0,c], orm is the empty match, 
and one of the following cases applies. 


~~ Y =X and (Hiq,™m,T) = X: 
d 


- ~=xAw and (Hig, m, T) H$ x or (Hig, m, T) Ee w. 
—w=A(f:nOn,x) and either there does not exist an th: tO Hig such 
that mo f = m, or there exists m and (Hio, Mu, T) H4 x. 
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— p= x Uw and for all tr’ with r'—7 € I (Hig, m, T’) H4, w or there exists 
T” €[r,7') such that (Hig, m, T”) EE x. 
— 4% = xSrw and for all r' with r— 7! € I (Hg, m, T’) H4, w or there exists 
T” € (TT), (Hig m, T”) Hag: 


In comparison to K, =% confines the lifespans of matches and the satisfaction 
of A to the period that has been observed, i.e. 10, c]. Moreover, =° relies on 
H4 for the satisfaction of a negation. Similarly to 4, =% confines the decisions 


for matches to [0, c], and relies on |=% for the falsification of negation. The match 
m never falsifies true. We note that =% and þf are not equivalent; |? returns 
true for time points that do not definitely satisfy the operator, i.e., points that 
falsify it but also points for which a definite decision cannot yet be made. 

The following theorem shows the soundness of the definite relations K% and 
H4 by relating them to the regular satisfaction relation = from Definition 3 and 
its negation 4. The theorem refers to observed prefixes of a possibly infinite 
RTM#-trace h” and their possible continuations; an RTM! Hin in h” is 
associated with the 7 of the event with index i € Nt in the execution h—see 
Section 2.1. The theorem states that a definite decision, i.e., a decision made 
by either =f or |=4, for a certain time point T over an H [7;] in h” implies that 
the same decision is made by |= (or þ£) for r over Hj,,); moreover, = makes the 
same decision for T over all possible future versions of Hyn] in h”. 


Theorem 1 (definite relations imply satisfaction relation over trace). 
Let y be an MTGC over a pattern n. Moreover, let h be RTM" -trace, with 
DENT. For alli € [1, D] ONF, if m is a match for n in Hy and T € (0, ri], 
then for all k € [i, D] ONH, (i) if (Hp m, T) H° Y, then (Hing m, T) Ev, and 
(it) if (Hij m, T) E¢ Y, then ea iA yp. 


Proof (idea). By mutual structural induction over w. The implication is shown 
to hold for each MTGL operator. See Section B.1 for the complete proof. 


In the following, we discuss the second important result of this section, i.e., 
the equivalence of the definite and regular semantics. 

The satisfaction decision for future temporal operators at time point T may 
depend on a 7’ > r. The upper bound of the distance between 7’ and 7 is given 
by the non-definiteness window, defined below. 


Definition 7 (non-definiteness window w). Given an MTGC w, the non- 
definiteness window w, i.e., the period for which a satisfaction decision for Y at 
a time point T may be non-definite, is defined as follows. 


rQ) + max (w(x), ww)) fY =xUrw 


max (w(x), w(w)) ify = x51% 
_ J max (w(x), w(w)) fp=xAw 
a w(x) if Y =x (7) 
w(x) if = A(n, x) 


0 if w = true 
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As usual in (online) RM, we assume that w Æ oo, i.e., MTGCs contain no 
unbounded future operators which may render a property non-monitorable [42]. 

Based on w, we present a variation of Theorem 1 which states that, given an 
Hin} if T € [0,7 — w], with 7 an index in a RT M#-trace, then definite decisions 
made by either the definite satisfaction relation |“ or definite falsification relation 
H4 are equivalent to the decisions of the satisfaction relation =. If w Æ 0, in 
order for [0,7; — w] to be a valid interval, it is implicitly required that 7; > w, 
i.e., Hin] covers a period that is larger than the non-definiteness window. 


Theorem 2 (definite relations are equivalent to satisfaction relation 
over certain period of trace). Let y be an MTGC over a pattern n and w 
the non-definiteness window of p. Moreover, let ie be an RTM?" -trace, with 
DEN. For alli € [1, D] ONF, if m is a match for n in Hin} and T € [0, T; — w], 
then for all k € [i, D] ONF, (i) (Hij m, T) =a y iff (Hin m, T) E Y, and (ii) 
(Hirm, T) Hi Y iff (Hir m, 7) E Y. 


Proof (idea). By mutual structural induction over 7. The equivalence is shown 
to hold for each MTGL operator. See Section B.2 for the complete proof. 


Theorem 2 enables our contribution to change-driven evaluation in Section 5. 
Finally, we present the third important result of the section, i.e., the limitation 

of the semantics. The following corollary states that all time points for which a 

definite decision cannot be made belong to a certain period in the observed trace. 


Corollary 1 (period in trace with non-definite decisions). Let y be an 
MTGC, w be the non-definiteness window of p, Hi be an RTM” instance 
associated with the time point Ti, m be a match for a pattern n, and T a time 
point in [0, Ti]. If (Hir; m, T) 4? Y and (Hin Mm, T) KE Y, then TE (T; — w, Ti]. 


Proof (idea). Follows from Theorem 2—see Section B.3 for the complete proof. 


We demonstrate below that, in case an MTGC is unsatisfiable (or unfalsifiable), 
the definite relations may return an answer with a delay. The maximum possible 
delay depends on the non-definiteness window w from Definition 7. 

Let |r and Fp r be respectively a satisfaction and falsification relation for 
MTGL that reflect the timeliest knowledge: Given a match m, an MTGC y, an 
RTME instance H [7] from a sequence of instances, and a time point T € [0, 7], 
(Hin m, T) Er Y if (Hij m, T) | Y and there exists no possible successor of 
H, in the sequence that could falsify y at 7; analogously, (Hir;]; Mm, T) Err Y if 
(Hir:]; Mm, T) J£ w and there exists no possible successor of H),,) that could satisfy 
wy at T. These timeliest relations can only make decisions for m over the observed 
trace, as m may not exist in the parts covered by successors of Hj,,), i.e., in time 
points larger than 7;. 

Given a sequence of RTMF instances h” with H [7;] an instance in h”, let 
Hin, be the first successor of H/,,) in h” for which Tk > 7; + w. The following 
corollary states that, contrary to =r and pt, the definite relations may have 
to wait for Hj,,) to be able to make a definite decision for T € (T; — w, Ti]. 
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Corollary 2 (maximum possible delay before definite decision). Let y 
be an MTGC, w be the non-definiteness window of p, m be a match for a pattern 
n, and H, be an RTM” instance from a sequence of RTM” instances he. with 
i € [1,D] ANF. Moreover, let T € (Ti — w, Ti] and k be the smallest index in 
li, D] ANF such that Tk > 7% + w. If (Hinj m, T) A w and (Hir: m, T) Et oh, 
then a definite decision for T can be made over Hj,,). 


Proof. Follows from Corollary 1. 


Thus, compared to Fr and rt, the definite relations may make a decision 
for T € (7 — w, Ti] with a delay of at most (Tk — 7;) time points. 


Example 2. (delay in definite decision) Let Ye = Qjo, 1} (7I m1 A dni). Consider 
an RTM#-trace comprising two RTM# instances: Hz in Fig. 3 and a hypo- 
thetical Hjo} which is yielded by an unrelated change and all elements from 
Hiz) are unchanged. Therefore, a match m; exists in both instances. The check 
(Hi mı, 7) Fer Ye returns true, as (Hi7, mi, 7) ve and there is no possible 
successor of Hiz) that could satisfy Yc; on the other hand, (Hj7,™1, 7) Et we 
makes no decision, as according to its definition, the relation waits first for a 
duration of history that covers the timing constraint of until to be observed. 
The check (Ajj, m1, 7) H4 p- returns true, as enough time has elapsed. Thus, 
compared to 7, this decision has been made with a delay of two time points. 


Avoiding this delay would require that the definite relations recognize whether 
an MTGC is satisfiable which is undecidable for NGCs and thus MTGCs. The 
delay is not observed with the running example, i.e., Yp = 741.1 Ujo,6o) 371.2 
or similar MTGCs, e.g., (Qjo,2)5 21.1) A (Oj0,3]5 11.2). 


4 Computations and Answer Set for Definite Semantics 


This section presents our contribution to the semantics of £, the query language 
of INTEMPO. Specifically, we adjust the satisfaction computation presented in 
Definition 4 to the definite satisfaction relation (=°) from Definition 6. Moreover, 
we introduce the analogous concepts for the definite falsification relation (H4). 
Theorem 3 shows the soundness of the introduced computations. Based on these 
computations, we introduce a definite answer set for £. 

In the context of a temporal query (n, yY) the definite satisfaction span related 
to a match m for n in Hj, is defined similarly to the satisfaction span Y in 
Section 2.3, i.e., Y? = {r|r E RA (Hig m, T) = y}. The definite falsification 
span is defined as F? = {r|r € RA (Hiq, m, T) 4 y}. Any time point in the 
time domain not in Yf or F belongs to the unknown span X. The sets Yt, F9, 
and X are disjoint. It also holds that R = Y4 w F? w X. The definite satisfaction 
computation Z? and the definite falsification computation F? for an MTGC are 
defined below. 
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Definition 8 (definite satisfaction computation Z? and definite falsifica- 
tion computation F“). Let n, ñ be patterns and, x,w be MTGCs. Moreover, 
let m be a match for n in an RTM" H, and M a set of matches for ù that are 
compatible with the (enclosing) match m. The definite satisfaction computation 
Z4(m,w) and definite falsification computation F°(m, y) are defined via mutual 
recursion as follows. 


Z4(m, true) = R (8) 
24(m, =x) = F4(m,x) (9) 
Z4(m, x Aw) = Z4(m, x) N Z4(m, w) (10) 
24(m, A(R, x)) = (—00,7] LJ A® N24(m, x) (11) 
MEM 
jN(Gt nied) ifo gI 
Zim, xUrw) Se meh EI (12) 


U iUUan(Gtnielh foel 
iEZi(m,w) je Je 
jN ((ini eT) ifo gI 
24 (m,xSrw) = ¢ E E 13 
Ae U cu Usaina goer 
iEZi(m,w) je Je 


with J? the set of all intervals in Z24(m, x) that are either overlapping or adjacent 
to some i € Z4(m,w). 

Based on R = Y4WF4W X, the definite falsification computation F4(m, Y) can 

be generally defined as F? = R \ (Zt w X), which leads to the following equations. 

F4(m, true) = 0 (14) 

F@(m, =x) = 2(m, x) (15) 

F° (m, x Aw) = F4(m, x) U F? (m,w) (16) 

F? (m, A( (17) 


(ñ, x)) = (=, 7] N (R \ 2%(m, (A, x))) 


in(grnnon)] ifo gI 


iEZi(m, D m,w), jE J? 


F°(m, xUqw) a 
r ( iu U in(grnnon)] ifo El 


iEZ4(m, ee m,w) jeJ? 
(18) 
R \ in(snven)) ifo gI 
i m,w m,w d 
F4(m, XSrw) = iEZa( i ), JES 


R \ iu U jangan) ifo el 


iEZi(m, RA w) ged? 
(19) 
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where J? is the set of all intervals in Z4(m, y)WX(m, x) that are either overlapping 
or adjacent to some i € 24(m,w) W X (m,w). 


Regarding Z4, the equations for conjunction, until, and since have the same 
structure with their corresponding equations in Definition 4, but rely on 24 
instead of Z. Analogously to 4, the computation for negation relies on F“. The 
computation for exists confines its decisions to the period that has been observed. 

Regarding F4, a match m never falsifies true; analogously to =e F? relies on 
Z4 for the falsification of negation; the operator exists confines its computation to 
the observed period; the equations for until and since complement their respective 
definite satisfaction computations, whereby the definite satisfaction computation 
for their operands y and w instead of considering only time points that definitely 
satisfy y and w, i.e., their satisfaction spans 24(m, x) and 24(m,w), considers 
time points that do not definitely falsify x and w, i.e., Z24(m, x) 4 X(m, x) and 
Z4(m,w) W X(m,w). 

The following theorem states that the set of time points in the definite 
satisfaction span Y? and definite falsification span F¢ are equal to the sets of 
time points obtained by the definite satisfaction computation 27 and definite 
falsification computation F%, respectively. 


Theorem 3 (equality of definite spans and definite computations for 
satisfaction and falsification). Given a match m in an RTM”? Hiz and an 
MTGC >, it holds that Y4(m,) = Z4(m,w) and F4(m, y) = F4(m,v). 


Proof (idea). The proof for Z4 proceeds by structural induction over w. The proof 
for F4 is based on the application of F4 = R \ (Zt w X) for each MTGL operator. 
See Section B.4 for the complete proof. 


Based on the definite computations, we now extend £ with a notion of definite 
answers by adjusting the answer set J in Definition 5. To this end, we define 
the notion of temporal invalidity JV as the dual notion of temporal validity V 
from Section 2.3, i.e., the intersection of the lifespan \” of a match m with 
the falsification span. Moreover, we define the definite temporal validity V4 as 
\™ NZI, and the definite temporal invalidity IV? as A” N 4, 


Definition 9 (definite answer set J“). Given a pattern n, an MTGC y, and 
an RTM®" H, the definite answer set J? of a query in £ over H is given by: 


J4(H) = {(m, V4(m, Y), IV? (m, w))|m is a match for nA (V4 £ OV IVE £ 0} 


Example 8 (precision of definite computations over incomplete trace). As in 
Example 1, the query (n1, 7p) is evaluated over Hy. This time however, we 
obtain the definite answer set J¢(Hj7}). The match mz for nı, that involves the 
object pm2, is not contained in J4; m is matched and its lifespan is computed 
to be A’? = [7,00) but no compatible match for nı.2 is found; As shown in 
Table 1, 24(m2, wp) = (—00, —53] and F4(mz2,~p) = 0. Therefore, both V? and 
JV‘ are empty, and the match is excluded from J*. Note that J¢(H [7]) contains 
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Table 1: Computations Z, Z4, and F? for two matches for (n1, Wp) over H [7]: 


MTGC Z zt Fé z za Fé 
true R R (1) R R (1) 
Ania () 0 (—0o, 7] 0 ) (—0o, 7 
7dni41 R (—co, 7] 0 R (—co,7| (i) 
true R R (1) R R 0 
dni.2 [5, 7) [5, 7) {(—00,5),[7,7]} Ø 0 (—00, 7] 
yp [—55, 7) [—55, 7) {(—oo, —55), [7,7]} @ 0 (—oo, —53] 
awp {(—oo, —55), [7, 00)} {(—co, —55), [7, 7]} [—55, 7) R (—oo, —53] 0 


a match mı for nı that involves pm1, as its V? is non-empty (see Table 1), i.e., 
there are time points for which m; definitely falsifies ~p, or definitely satisfies 
wp. All computations in Table 1 are interval sets (see Section 2.3), however, for 
presentation purposes, singletons are displayed as intervals. 

Let Hisz) be an RTME that is yielded by an event at time point 67; the changes 
by this event do not affect vertices or nodes in H7]; m2 would be returned by T$, 
paired with V? = [7,7], as there would be no future version of the RTM™ which 
could satisfy wp at time point 7. 


5 Keeping to Change-driven Evaluation 


The operationalization of queries in INTEMPO (sce also Fig. 1) is based on 
Generalized Discrimination Networks (GDNs) [28, 10]. Specifically, a query in £ 
is decomposed into a suitable ordering, i.e., a network, N of simple sub-queries. N 
is a tree where each node represents a query and each edge a dependency between 
queries—see Fig. 2 (right) for the GDN for wp. N is executed bottom-up, i.e., the 
execution starts with leaves and proceeds upward. The root of N computes the 
answer set J(H) of q. Each node in N stores intermediate matches paired with 
their Z; therefore N is amenable to change-driven and incremental execution: 
changes to H are propagated through N, whose nodes only recompute their 
stored matches if the change is relevant to them or one of their dependencies. 
Moreover, INTEMPO offers a method to remove temporally irrelevant history 
from the RTM, thereby rendering the query evaluation memory-efficient. 
Based on these features, an extensive experimental evaluation of our im- 
plementation of INTEMPO showed efficient performance in the evaluation of 
temporal graph queries over considerably large models (approximately from 
10K to 48M elements) [49]. INTEMPO also evaluated queries faster than the 
established RV tool MONPOLy [6] as well as the RTM-based tool HAWK [24] in 
an RM application scenario. In the scenario, incomplete traces were handled by 
performing a check for each match which, based on the timing constraints of the 
property, postponed returning the match if future changes could affect it. 
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The definite answer set J? from Definition 9 handles incomplete traces com- 
prehensively, as it only includes matches and time points which no future change 
can affect. However, J¢ relies on the definite MTGL semantics from Definition 6 
which, contrary to the regular semantics from Definition 3, considers the time 
point on which a query is evaluated; consequently, adjusting N to compute the 
definite computations Z? and F%, and thus to return J, would imply that every 
new version of H/,); would trigger a re-computation of all spans stored in N. 
Therefore, J is not amenable to change-driven evaluation. 

Based on the intuition behind the check from above, we lastly present a new 
answer set, called effective, that contains definite results while relying on J, which 
is amenable to change-driven evaluation. Specifically, based on the equivalence 
in Theorem 2, we show that J is equivalent to a subset of J? if the V of matches 
in J is restricted to a period with definite decisions (see Corollary 1). This last 
contribution formalizes the intuition behind the check from above, and allows 
approaches like INTEMPO to maintain their efficiency while returning sound 
results. We define the effective answer set J° for £ based on T below. 


Definition 10 (effective answer set J°). Given a pattern n, an MTGC % 
with w the non-definiteness window of Y, an RTM” Hij, and an answer set 
J(A]) of a query in £, the effective answer set J°(H,}) of the query is the 
set of all tuples (m, V N [0,7 — w]) such that (i) (m, V(m, ~)) € T Ajj) and (ii) 
Vim, Y) N [0,7 — w] 4 4. 


The following theorem states that J° is equal to a restricted version of J? 
whose V? excludes a period equal to w. We assume that the trace duration is 
larger than w and that the trace has more than one member. 


Theorem 4 (equality of effective answer set and restricted definite 
temporal validity answer set over trace). Let (n,w) be a query with w 
an MTGC, w be the non-definiteness window of p, and hZ be a RTM” -trace 
with D € [2, œ] NNF, and i be an index in |k, D — 1] ONF such that Tk > w. 
Moreover, let TZ, (H [7;]) be the restricted definite temporal validity answer set 
over Hi- which has been obtained from the definite answer set J? but contains 
(i) only pairs of matches with their temporal validity V4, with Vt 4 0 and (ii) V4 
is intersected with [0, T; — w]. Then, T°(Hj7,)) = TS (Hiri): 


Proof (idea). Based on the more general Theorem 2. See Section B.5 for the 
complete proof. 


Theorem 4 shows how INTEMPO returns definite results while using the change- 
driven evaluation for J described above. On the other hand, as TS p excludes 
F?, obtaining F% with J° requires the evaluation of a separate query (n, =y) in 
parallel to (n, Y). Moreover, due to postponing returning answers that may be 
non-definite, J? may return answers with a delay; although this is not observed 
in wp from the running example, it may affect other properties, as demonstrated 
in Example 4. Hence, J° is intended for application scenarios where this impact 
is either absent or acceptable. 
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Example 4 (Delay in detection). Let Yp = (73n1.1)A(>00,2)57"1.2) be an MTGC 
and (n1,-%p) a query in £. Let Hj) be a hypothetical RTME that contains a 
match for nı and a match for n1.1, whose lifespans are [5, 00). The time point 5 
is contained in V4(m 1, -7Wp), i.e., the decision for 5 is definite; however, this time 
point is not admitted to J°(Hj5]) due to the intersection with [0,5 — w], where, 
for Wp, w = 2. The time point will be admitted to T° when w has elapsed. 


6 Related Work 


In our previous work, we presented an analysis procedure with preliminary support 
for RM of MTGL, as the procedure can be adjusted so that it returns true either 
as soon as a falsification is detected or only when it has become definite [51]. 
When a falsification is detected, the procedure returns the time point on which 
the procedure was last executed. The result abstracts the interval-based semantics 
of MTGL into a point-based interpretation which lacks precision. The definite 
semantics from Section 3 supports RM of MTGL directly, i.e., at the level of 
semantics. Moreover, it enables the computations of the definite falsification and 
satisfaction spans, which in turn enable practical query evaluations. 

Compared to INTEMPO and its advancement we presented, other query-based 
approaches for RM over structural RTMs either lack a formal treatment of 
monitoring, e.g., [24, 1], or do not support other key features, e.g., first-order 
quantification [19], temporal operators [14, 13], or timing constraints [40]. On the 
other hand, these approaches have their own advantages over the foundations we 
presented, e.g., support for distributed query evaluation [14] and more temporal 
primitives [24]. 

Runtime Verification (RV) is also concerned with formally precise online 
RM over incrementally processed, and thus possibly incomplete, traces of events. 
Despite the similarity of their aim, RV and RTMs are different in their applications 
and characteristics: for instance, state representations in RV focus on a low level 
of abstraction and are typically inaccessible during monitoring. Conversely, an 
RTM aims at a richer knowledge representation [14] and has to be accessible to 
end-users or other technologies during monitoring, as it acts as an interface to 
manage the system [23]—see [47, 49] for a more elaborate comparison. In RV, 
properties may be specified using various formalisms, e.g., temporal logics and 
regular expressions [3], comparisons among which are non-trivial [33, 43]. In the 
following, we focus on approaches based on temporal logics. According to a recent 
classification, no approach simultaneously supports key features of INTEMPO 
such as first-order quantification, metric temporal constraints, interval-based 
interpretations, and native support for graph queries and bindings [22]. 

The RV approach most relevant to our work is MONPOLy [6]. MONPOLY, an 
established tool that has been among the top-performers in an RV competition [2], 
is an implementation of an incremental monitoring algorithm based on Metric 
First-Order Temporal Logic (MFOTL) [7]. The semantics of MFOTL is point- 
based, i.e., the logic assesses the truth of a formula only for the time points 
of events in a trace, which means the logic cannot support the computation 
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of a temporal validity or represent the lifespan of a match straightforwardly. 
MonPoOLy cannot always encode complex graph queries: for instance, expressing 
the MTGC from the running example, which prohibits the existence of a pattern, 
is not possible as MONPOLY restricts the use of negation in this place at the 
formula for reasons of monitorability. Even when possible, this encoding may 
become overly technical and, as indicated by the performance comparison of 
INTEMPO to MONPOLy [49] as well as another similar comparison [19], may 
affect performance: for instance, emulating graph pattern matching requires that 
partial orderings of match candidates are explicitly formulated in MFOTL which 
may bloat the size of the formula. 

The RV tool DEJAVU [31, 30] monitors properties specified in a first-order 
metric past-only logic with point-based semantics. Translating MTGCs in this 
logic would require emulating graph-based encodings and bindings (similar to 
MonPoty) and, moreover, reformulating MTGCs such that they feature only past 
operators. Such reformulations are not always possible and could be significantly 
less compact [37, 32]. Monitoring algorithms for interval-based propositional or 
signal logics with metric timing constraints [5, 38] are capable of interval-based 
interpretations; although inapplicable to a graph-based first-order setting, they 
are therefore based on interval computations which are similar to ours. Havelund 
et al. present a monitoring approach for a logic defined over intervals; properties 
in the logic refer to interval relations, e.g., requiring that two intervals overlap, 
where the intervals my contain data [29]. The logic supports quantification over 
intervals but does not support quantification over the data. 


7 Conclusion and Future Work 


We present a formal and systematic treatment of incomplete traces in query-based 
runtime monitoring of temporal properties over structural runtime models. First, 
we introduce a new semantics for a first-order temporal graph logic, called definite, 
which only returns decisions if no future change to the model will affect them. 
Then, based on the definite semantics, we introduce a new definite answer set 
for the query language of INTEMPO, a querying scheme we previously presented. 
Lastly, we present the effective answer set which, contrary to the definite answer 
set, is amenable to change-driven evaluation. This answer set allows approaches 
like INTEMPO to maintain their efficiency while returning definite answers. 

Our plans for future work include a consideration of a rewriting procedure 
for properties in MTGL, such that the rewritten properties avoid or minimize 
possible delays in returning results, while allowing for a comparable performance 
to the property before rewriting. We plan to extend the API of the INTEMPO 
implementation with the option to return the effective answer set directly. More- 
over, we plan to implement the definite answer set and investigate its impact on 
performance. Although not as efficient as the effective answer set, we also plan 
to use the definite answer set for testing the answers in the effective answer set. 
Finally, we plan to extend INTEMPO with a decision procedure that, depending 
on the property, switches to the answer set that is more appropriate. 


Foundations for Query-based RM of Temporal Properties over RTMs 41 
A Overview of Notation 


The overview is shown in Table 2. 


B Proofs 


Following are the proofs for the theorems in the paper, as presented in the 
doctoral thesis of the first author [47]. 


B.1 Theorem 1: definite relations imply satisfaction relation over 
trace 


Following is the proof for Theorem 1 (see [47, Section A.3.2]), i.e., given an 
MTGC y over a pattern n and an RTM#-trace h with D € N* the last index, 
for all i € [1, D] A NF, if m a match for n in Aj,,) and r € [0,r;], then for 
all k € fi, D] ONF, (i) if (Himn m, T) H4 y, then (Hing m, T) = Y, and (ii) if 
(Hini m, T) Hi Y, then (Hin m, T) iA p. 


Proof. By definition of the RTM®, a match m in Hj [7,] will be structurally present 
in all Hyn] with k € [i, D] 1N*—-what may change (once) in future versions of 
Hiz; is the lifespan of m, i.e., if the dts of all matched elements is oo and one of 
these elements is updated to a value less than oo; even then, this change will not 
affect the lifespan of m in the period [0,7;], that is, in Hj,,), the observation on 
whether m is present in \ N [0, 7%] will never be refuted. 

The proof proceeds by mutual structural induction over w. In the base case, 
we show the theorem to be true for the MTGL operator true. We omit the 
straightforward step for conjunction. 


— Base case: true. 
We begin with the definite satisfaction. We assume (H{,,),m,7) =? true 
and show that (Hij; m, T) true for an arbitrary k € [i, D] N N*. By the 
semantics of MTGL, true is always satisfied. Therefore, m in H/,,) also satisfies 
true at 7. We have shown that the implication is true. 
We proceed with the definite falsification. Based on the semantics of the definite 
falsification relation, a match m never falsifies true. Therefore, the antecedent 
(Hi-] Mm, T) -% true is false, making the consequent (Hj,,),m,7) j4 true true. 
— Induction step: Y = =x. 
We begin with the definite satisfaction. Assume that (Hyn, m, T) =% x > 
(Hir) m, T) x for an arbitrary k € fi, D] ON*. By the semantics of negation 
and the definite relations, (H/,,),m,7T) =4 x (Hir m, T) H4 ~y. Simi- 
larly, (Hp m, T) Fx © (Hirm m, T) H| 7x. Therefore, it also holds that 
(Hirm, T) H? x> (Hir m, T) SX. 
We proceed with the definite falsification. Assume that (H/,,),m,T) 4 y => 
(Hin m, T) = x. Analogously to the definite satisfaction, (Hj,,),m, T) i y 
(Hir Mm, 7) Ed ~y and (Hrp m, T) E x © (Hir, m, T) F 7x. Therefore, 
(Hir m, T) Hi IN (Hir m, T) E 1X: 
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Symbol Concept Formal Representation Def. 

P temporal property from running - p. 3 
example 

G, runtime model, at time point 7 typed attributed graph p.4 

T time point real number p. 5 

h- event trace, spanning the interval sequence of events p. 5 

[0,7] 

i index of sequence member natural number p. 5 

Ti time point at i-th member of real number p. 5 
sequence 

é mapping from events to graph function p. 6 

modifications 
Hiz runtime model with history, typed attributed graph p. 6 


spanning the interval [0,7] 
hž RTM#"-trace, spanning the interval sequence of runtime models with p. 6 


[0,7] history 
WP, X, w temporal property metric temporal graph condition p. 6 
n, ù (graph) pattern typed attributed graph p. 6 
= (regular) satisfaction relation of relation p. 7 
metric temporal graph logic 
m,m match morphism p.7 
£ query language of INTEMPO set of queries p. 7 
E set of matched vertices set of vertices in given match p. 7 
e matched vertex vertex in E p. 7 
A” lifespan of a match m interval p. 8 
y satisfaction span interval set p. 8 
Zz satisfaction computation interval set p. 8 
V temporal validity interval set p. 8 
M set of matches of 7» compatible to set of matches p. 8 
m 
F (regular) answer set of £ set of (m, V) pairs p. 9 
= definite satisfaction relation relation p. 10 
= definite falsification relation relation p. 10 
c current time point real number p. 10 
D last member of sequence natural number p. 11 
w non-definiteness window interval p. 11 
Hr timeliest satisfaction relation relation p. 12 
Err timeliest falsification relation relation p. 12 
yd definite satisfaction span interval set p. 13 
za definite satisfaction computation interval set p. 13 
Fa definite falsification span interval set p. 13 
F? definite falsification computation interval set p. 13 
X unknown span interval set p. 13 
ya definite temporal validity interval set p. 15 
JV temporal invalidity interval set p. 15 
Jya definite temporal invalidity interval set p. 15 
a definite answer set of £ set of (m, Vt, IV?) triples p. 15 
N network generalized discrimination network p. 16 
Tt, restricted temporal validity answer subset of J? only with yi p. 17 
set of £ 
ve effective answer set of £ subset of J with V capped based p. 17 
on w 


Table 2: Main symbols, their denoted concept, and formal representation; the 
rightmost column shows the page on which the symbol was first defined. 
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— Induction step: Y = A(f, x). 

Let the induction hypothesis be (Hj,,), 1,7) 4 y => (Hir) 1, T) E x and 

(Hin M, T) H4 x => (Hin M, T) AX, where m is a match for the pattern ñ 

and k an arbitrary index in [i, D] NNT. 

We begin with the definite satisfaction. We assume (H,,,),m,T) H? (fi, X) and 

show this implies (H[,,],m,7) (fA, x). Since (Hj;,),m,7) 7 A(A, x), there 

exists matches m and ñ such that ñ is compatible with m and T € A™® N A™. 

The matches m,m will be structurally present and m will be compatible 

with m in all future versions of Hj}. Moreover, there will be no changes in 

A™, A® for the period [0,7]. Also, by the induction hypothesis, ñ satisfies 

x at T. Therefore, by the semantics of the satisfaction relation for exists, 

(Hir m, T) H (nv, x). We have shown that the implication is true. 

We proceed with the definite falsification. We assume that (H),,),m,T) = 
A(f, x) and show that this implies (H/,,)},m,7) A (nf, x). Since (H/,,),m,7) 
Ee A(f, x), (i) either there exists no ñ in Hp such that mm is compatible 

with m, or (ii) there exists 7» compatible with m, but tT g A™ N A®, or (iii) 

there exists 7 compatible with m with r € A™ N A® but M definitely falsifies 

x at T. If (i) is true, it will be true in all future versions of Hj,,), as matches 

cannot be found retrospectively. If (ii) is true, the lifespan of \” in the period 

(0, r;] will not change in all future versions of Hj,,). Finally, if (iii) is true, we 

know from the induction hypothesis that (7,7) A x also over Hj,,). Therefore, 

in any case, (Hir, Mm, T) A A(n, x). We have shown that the implication is 
true. 
— Induction step: Y = xUrw. 

We begin with the definite satisfaction. Induction hypothesis: (Hj,,),m,7) 
Ld y => (Hir m, T) E x and (Hrm, T) =i w > (Hr p m, T) Ew with k 
an arbitrary index in [i,D] N NF. 

We assume (Hij; m, T) H? xUrw and show this implies (Hn, J; m, T) — 
xUrw. Since (Hin M, T) H4 yUzw, there exists 7 such that T’ — r € I and 
(Hi Mm, T’) H w, and for all 7” € [r, 7") (Hg]; m, T”) H? x. The decisions 
for the time point 7’ and for all time points 7” either concern a match or not: if 
they do concern a match, then they are confined to [0,7] and remain unaltered 
throughout the trace; if they do not concern a match, e.g., they concern true 
or true, then they again remain unaltered. Therefore, also over Hin} it will 
hold that at 7” (A,,),m,7') E w, and for every T” (Hj,,),m,7”) = x. Thus, 
by the semantics of the satisfaction relation for until, (H/,,),m,7) = xUrw. 
We have shown that the implication is true. 

We proceed with the definite falsification. Let the induction hypothesis be 
Bin ™ T) EF x= (Hin.] mM, T) Ax and (Hp.); m,T) =g w= (Air); m,T) 
w. 

We assume (Hj,,),m,7) H4 xUrw and show that this implies (Hj,,),m,7) 
+ xUrw. Since (Hj,,),m,7) 4 xUrw, for all 7’ such that r’—7 € J, either (i) 
(Hin Mm, T’) H$ w or (ii) there exists 7” € [7,7’) such that (H[,,),m,7”) 2 x. 
Regardless of which is the case, i.e., (i) or (ii) or both, analogously to the 
definite satisfaction, if the decisions for all 7’ and at 7” concern a match, 


44 L. Sakizloglou et al. 


they will remain unaltered, and so will they if they do not concern a match. 
Therefore, the case will also hold over H]. Therefore, (Hj,,),m,7) A xUrw. 
We have shown that the implication is true. 
— Induction step: Y = xSrw. 

The proof proceeds analogously to until. We begin with the definite satisfaction. 
Let the induction hypothesis be (H,,),m,7T) Hd y > (Hir m, T) = x and 
(Hp m, T) E? w > (Hinj M, T) H w with k an arbitrary index in [i, DJ NNT. 
We assume (Hj; m, T) H? xSrw and show this implies (H_,,),m eT) = 


xSrw. Since (Hir; M, T) H4? ySrw, there exists 7’ such that rt — 7’ € I 
and (Hi; M, T’) f w, and for all T” € (7,7) (Hjz,,m,7”) H? x. The deci- 
sions for the time point 7’ and all time points 7” either concern a match or 
not: if they do concern a match, then they are confined to [0, 7;] and remain 
unaltered throughout the trace; if they do not concern a match, then they 
will again remain unaltered. Therefore, also over Hn, it will hold that at 7’ 
(Hir, m, T’) = w, and for all 7” (Hir,} m, T”) = x. Thus by the semantics of 
the satisfaction relation for since, (Hin); m, T) =| xSrw. We have shown that 
the implication is true. 

We proceed with the definite falsification. Let the induction hypothesis be 
ee T) Hi x= (Aas m,T) JÆ x and (Air); m,T) Kt w= (Air), m,T) 

w. 

We assume (H/,,),™,T) H4, yS;w and show that this implies (Hir m, T) 
+ xSrw. Since (Hin], Mm, T) H4 XSrw, for all r’ such that + — 7’ € J, either (i) 
(Hin M, T’) H$ w or (ii) there exists 7” € (T’, T] such that (H[,,),m,7”) f x. 
Regardless of which is the case, i.e., (i) or (ii) or both, analogously to the 
definite satisfaction, if the decisions for all 7’ and at 7” concern a match, 
they will remain unaltered, and so will they if they do not concern a match. 
Therefore, the case will also hold over Hj,,). Therefore, (H[,,),m,7) A xSrw. 
We have shown that the implication is true. 


From the base case and induction steps, it follows that Theorem 1 holds. 


B.2 Theorem 2: definite relations are equivalent to satisfaction 
relation over certain period of trace 


Following is the proof for Theorem 2 (see [47, Section A.3.3]), that is, given 
an MTGC w over a pattern n, the non-definiteness w window of y, and a 
sequence of RTMF instances h with D € N* the last index, for all i € 
[1, D] A NF, if m a match for n in Hj] and 7 € [0, 7; — w], then for all k € 
fi, D] NNF, (i) (Hirj,™, T) =a yp iff (Hin) m, T) H Y, and (ii) (its mM, T) Ee yp 
iff (Air) m,T) A Yp. 

By definition of the RTM", a match m in H [7;] Will be structurally present 
in all Hin] with k € [i,D] ` N*—what may change (once) in future versions of 
Hiz; is the lifespan of m, i.e., if the dts of all matched elements is oo and one of 
these elements is updated to a value less than oo; even then, this change will not 
affect the lifespan of m in the period [0,7;], that is, in Hj,,j, the observation on 
whether m is present in A™ N [0, 7;] will never be refuted. 
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Proof. The direction = of the equivalence has been shown by the more general 
Theorem 1, which concerned an arbitrary T. We therefore focus on direction 
<= of the equivalence. As m is present in H}, its lifespan \"" in the period 


(0, 74] will remain unchanged in subsequent versions of H,,). In the following, the 


non-definiteness window w is computed according to Definition 7. 


The proof proceeds by mutual structural induction over %. In the base case, 


we show the theorem to be true for the MTGL operator true. We omit the 
straightforward step for conjunction. 


— Base case: true. 


— Induction step: Y = =x. 


We begin with the satisfaction. We assume (H,,),m,7) true for an arbitrary 
k € [i, D] ANF and 7 € [0, 7; — w] with w”? = 0, and show that this implies 
(Hir: m, T) 4 true. As true is always satisfied, m in H (7;| definitely satisfies 
true at T. Hence, the implication to be true. 

We proceed with the falsification. Based on the semantics of satisfaction, a 
match m never satisfies | true. Therefore, the antecedent (Hj,,),m,7) j£ true 
is false, making the consequent (H,,,),™m,T) H4, true true. 


We begin with the satisfaction. Let (Hipp Mm, T) E x > (Hir; Mm, T 4 x 
for an arbitrary k € fi, D] O N* and 7 € [0, r; — w] with w(7y) = w(x). 
By the semantics of negation and the satisfaction relation, (Hj,,),m,7) Æ 
x e (Himm, T) H mx. Similarly, (Hp m, T) ES x e (Hr m, T) H? 7x. 
Therefore, it also holds that (Hyn) m, T) = 7x > (Hir m, T) Ld ny. 

We proceed with the falsification. Assume (Hin j m, T) F x > (Hirm, T) 
H? x. Analogously to the satisfaction, (Hin) m, T) E x € (Hij m, T) E 7x 
and (Hipp m, T) H x e (Hin m, T) HE 7x. Therefore, (Hij m, T) K 
71 => (Hinj m, T) =h 7X: 
Induction step: Yy = A(f, x). 
Let the induction hypothesis be (A[,,),7,7) E x > (Hr M, T) =4 y and 
(Hip) 1757) E x > (Hr: M, T) H4 y, where 7 is a match for the pattern ñ, 
k an arbitrary index in [,D] N Nt, and 7 € [0,7; — w]. The non-definiteness 
window w is given by w(A(f, x)) = w(x). 

We begin with the satisfaction. We assume that (Hj,,),m,7) = (nm, x) and 
show that this implies (H),,),m,7) 2 S(fi, x). Since (H[,,),m,7) = A(f, x), 
there exists matches m and m in Hj,,) such that ñ is compatible with m 
and r € NX" MX”. The match m is present in H; [7,] and, according to the 
induction hypothesis, the match 7 is also present in H/,,). As the matches 
are structurally the same, ñ is also compatible with m in H/,,). Moreover, as 
there are no changes in \", \” for the period [0,7], 7 € A” N A™ over Hir) 
We also know that 7 < 7; and, by the induction hypothesis, that m satisfies x 
at T. Therefore, by the semantics of the definite satisfaction relation for exists, 
(Hin Mm, T) H? S(f, x). We have shown that the implication is true. 

We proceed with the falsification. We assume that (Hj,,),m,7)  A(n, x) and 
show that this implies (Hin, m, T) =$ (ñ, x). Since (H[,,)},m,7) KF (à, x), 
(i) either there exists no m in H] such that ™ is compatible with m, or (ii) 
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there exists ñ compatible with m, but T £ A™ N A®, or (iii) there exists m 
compatible with m with r € A™ N A® but m falsifies y at r. If (i) is true, 
it will be true in all future versions of Hj}, as matches cannot be found 
retrospectively. If (ii) is true, the lifespan of \"" in the period [0,7;] will not 
change in all future versions of H,,). Finally, if (iii) is true, we know from 
the induction hypothesis that (7,7) 4 x also over H i] and that T < Ti. 
Therefore, in any case, (H/,,),m,T) H24, (ñ, x). We have shown that the 
implication is true. 

Induction step: p = xUyw. 

We begin with the satisfaction. Let the induction hypothesis be (H[,,),m,7) H 
xX => (Hr Mm, T) H? x and (Hip m, T) Fw > (Hr m, T) Ew with k an 
arbitrary index in [i, D] ANF and 7 € [0,7; — w]. The non-definiteness window 
w is given by maz(w(x), w(w)) + r(I). 

We assume (Hj,,),m,7) H xUrw and show (Hyj m, T) H xUrw. Since 
(Hir, Mm, T) F xUrw, there exists T’ such that 7’ — 7 € I and (Hir, } Mm, T’) = 
w, and for all 7” € [r, T") (Hin m, T”) H x From r € [0, 7r; — w] and 
T’ € [r +I), T +r(I)], it follows that 7’ < t; — max(w(x), w(w)). Based on 
this and the induction hypothesis, (H/,,),m,7’) 4 w. Moreover, as T’ stems 
from a period outside the non-definiteness window of w, the decision at 7’, 
whether it concerns a match or not, will remain unaltered once made. 

The decision at 7’ as well as the preceding period [r,7’) are also outside 
the non-definiteness window of x. Thus, all 7” € [r, rT") stem from a period 
covered by H,,j, and decisions for x made in this period are definite. Therefore, 
for all [r + L(I), T +7) (Hj m, T”) 4 x, and, by the definite semantics, 
(Hing m, T) =4 yUyw. We have shown that the implication is true. 

We proceed with the falsification. Let the induction hypothesis be that 
(Hin m, T) E x > (Hmp m, T) ES x and (Hij m, T) E w => (Hig m, T) 


EG w. 
We assume (Hij m, T) J xUrw and show (Hrg m,T) H4 Uw. Since 
(Hir m, T) A xUrw, it holds that for all 7’ such that 7’ — 7 € I either (i) 
(Hin m, T’) Fw or (ii) there exists T” € [r, T’) such that (H[,,),m,7”) = x. 
Regardless of which is the case, i.e., (i) or (ii) or both, analogously to the 
satisfaction, the decisions for all r’ and at T” stem from a period that is covered 
by Hj,,), and decisions made in this period regarding x and w are definite. 


Therefore, the case will also hold over H,,). Therefore, (Hj,,),m, T) H4 yUrw. 
We have shown that the implication is true. 

Induction step: p = xS yw. 

We begin with the satisfaction. Let the induction hypothesis be (H[,,],™m,T) 
xX => (Hmp Mm, T) = x and (Hin Mm, T) Hw > (Hig m, T) H? w with k an 
arbitrary index in [i, D] ANF and 7 € [0, r; — w]. The non-definiteness window 
w is given by maz(w(x), w(w)). 


l 


We assume (Hj,,),m,7) H xSrw and show (Hij m, T) H? ySrw. Since 
(Hir, Mm, T) E xSrw, there exists 7’ such that T — 7’ € I and (Hj,,),m,7’) H 
w, and for all 7” € (7',7] (Hirm, T”) H x. From 7 € [0,7% — w] and 


T’ € [r —r(I), T — €(D)], it follows that T’ < 7; — max(w(x), w(w)). Hence, the 
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decision at 7’ can already be made over Hj„,}, and, moreover, as 7’ stems from 
a period outside the non-definiteness window of w, the decision at 7’, whether 
it concerns a match or not, will remain unaltered once made. Therefore, 
(Hin) Mm, T’) |e“ w. The decision at 7’ as well as the succeeding period (7’, 7] 
is also outside the non-definiteness window of x. Thus, all r” € (7’,7] stem 
from a period covered by H{,,), and decisions for x made in this period are 
definite. Therefore, for all 7” € (7',7] (H[-,),m, T”) IL-4 y, and, by the definite 
semantics, (H,,),m, T) H4 yS7w. We have shown that the implication is true. 
We proceed with the falsification. Let the induction hypothesis be that 
(Hir m, T) xX > (Hrg m, T) H4, x and (Hirm, T) E w > (Hrg m, T) 


H4, w. 
We assume (Hr, Mm, T) £ xSrw and show (Hir; M, T) Ed xSrw. Since 
(Hirm, T) FA xSrw, it holds that for all 7’ such that r — 7’ € I either (i) 
(Hir M, T’) Æ w or (ii) there exists T” € (T', T] such that (Hipp m, T”) E x. 
Regardless of which is the case, i.e., (i) or (ii) or both, analogously to the 
satisfaction, the decisions for all 7’ and at T” stem from a period that is covered 
by Hj,,;, and decisions made in this period regarding x and w are definite. 
Therefore, the case will also hold over H),,). Therefore, (Hj,,),m,T) H4 xSrw. 
We have shown that the implication is true. 


From the base case and induction steps, it follows that Theorem 2 holds. 


B.3 Corollary 1: Period in trace with non-definite decisions 


Following is the proof for Corollary 1 (see [47, p. 32]), that is, if a is an MTGC, 
w is the non-definiteness window of 4%, H),,) is a RTME instance associated with 
the time point 7;, m is a match for a pattern n, and 7 a time point in [0,7;], then 
if (Hij M, T) A? w and (Hir: m, T) JAZ w, then T E (Ti — w, til. 


Proof. The proof follows from Theorem 2. The satisfaction relation and its 
negation make a decision for every time point in [0,7; — w], i.e., the relation 
does not support the value unknown; Theorem 2 shows that the decisions made 
by the satisfaction relation and its negation for [0,7; — w] are equivalent to the 
decisions made by the definite relations. Consequently, if no definite decision is 
made for 7 € [0,7;], then 7 ¢ [0, r; — w]. 


B.4 Theorem 3: Equality of definite spans and definite computations 
for satisfaction and falsification 


Following is the proof for Theorem 3 (see |47, Section A.3.4]), i.e., given a match 
m over a RTM! Hı and an MTGC y, the definite satisfaction span yd of m 
for Y over Hj is given by the definite satisfaction computation Zd of m for Y 
over H,) in Definition 8, that is, y4(m, p) = 24(m,). Moreover, the definite 
falsification span F of m for ~ over Hy is given by the definite falsification 
computation F of m for y over Hy in Definition 8, that is, F(m, Y) = F(m, 7). 
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Proof. The proof for the definite satisfaction span Z% proceeds almost identically 
to the proof for Theorem 1 for Z in [47, Section A.3.1], i.e., by structural induction 
over p, and therefore omitted. For true, conjunction, exists, until, and since in 
Definition 8, inclusion can be shown in both directions—the proof for the negation 
relies on a reasoning analogous to the one presented below for negation for the 
definite falsification span. 

The proof for the definite falsification F is based on the application of 
F=R\ (Z? w X) for each MTGL operator—which follows from R = Yt w Fw X. 
The unknown span X for true is X = 0, whereas for exists, by definition of the 
RTM# Hiz}, it is X = (7,00). If F is known, it can be used to compute Zt wW xX. 


— y = true: From Equation 8 in Definition 8, we have Z4(m, true) = R, therefore 
F(m, true) = 0. 
— p = =x: It holds that 


F(m, =x) = 24(m, ay) Y X (m, 7x) 


and = 

Z4(m, x) = Z4(m, =x) Y X(m, =x) 
Therefore, _ 

F(m, =x) = 24(m, x) = &4(m, x) 

— w=x Aw: Let each time point that does not definitely falsify the MTGC a 
that x encloses to be assumed to satisfy the a. In practice, this includes all 
time points in Z4(m, x) W X (m, x) for a. Subtracting this maximal satisfaction 
span from the time domain R yields the set of time points that definitely falsify 
xy. Let the satisfaction span of w be defined analogously. If the satisfaction 
span of conjunction is computed based on these maximal satisfaction spans of 
x and w, i.e., by (Z4(m, x) X (m, x)) N (Z4(m, w) W X(m,w)), the definite 
falsification span of conjunction can be computed analogously. 


F(m,x ^w) =R\ ((24(m, x) W X(m, x)) N (Z4(m, w) w X(m,w))) 


= R \ ((R\ F(m,x)) N (R \ F(m,w))) 
= F(m, x) U F(m,w) 


— y = (A, x): Let T be the time point of the RTM” Hy. As Z(m, 3(û,x)) is 
known and X(m,J3(û,X)) = (7,00), to obtain the falsification computation, 
we can directly solve R \ (27 X). 


F(m, A(A, x)) = R \ (Z4(m, A(A, x) U (7, 00) 
= (R\ (7,00) a Ah, x))) 
= (—00,T] N (R \ Z4(m, (à, x))) 
— w = xUrw and 0 ¢ I: The computation for until relies on the reasoning 


explained in the case of conjunction. The satisfaction span of until is computed 
based on the maximal satisfaction spans of w, i.e., Z4(m,w)W X(m,w), and x, 
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that is, J is obtained by 24(m,w) W X(m,w) and 24(m, x) W X(m, x), thus 
the until satisfaction span is similarly maximal. Therefore, complementing 
this maximal satisfaction span yields all time points that definitely falsify 
until. Therefore, we have: 


Flom xs) =R\ ( U in(grnnos)] 


1€Z4(m,w)UX (mw), jE TX 


— p = xUyw and 0 € I: The reasoning is similar to the case where 0 ¢ I. 

— w= xS;w and 0 ¢ I: The case proceeds analogously to the corresponding case 
of until. 

— w= xSrw and 0 € I: The case proceeds analogously to the corresponding case 
of until. 


By showing that Y4(m, Y) = Z4(m, Y) and the equations for F(m, y), we have 
shown that theorem holds. 


B.5 Theorem 4: Equality of effective answer set and restricted 
definite temporal validity answer set over trace 


Following is the proof for Theorem 4 (see [47, p. 57|), which states that, if 
¢ := (n,w) is a temporal query with y an MTGC, w is the non-definiteness 
window of 4%, hË is a RTM"-trace with D € [2,00] N NĦ, i is an index in 
[k, D —1] ONF such that 7 > w. TS (Hiri) is the restricted definite temporal 
validity answer set over H,,) which has been obtained from the definite answer 
set J? but contains (i) only pairs of matches with their temporal validity V? with 
Vt Æ Ü and (ii) V? is intersected with [0,7; — w], then the effective answer set 
T (Hing) is equal to TS „(Hin])- 


Proof. Based on the more general Theorem 2 which shows that, for 7 € [0, T; — w], 
the satisfaction decision for r in H}, is equivalent to definite satisfaction decision 
for 7 in Hin}. The computations of V and V4 over Hj,,) rely on the computations of 
Z and Z% over H [7;] respectively. Theorem 1 in [47, Section A.3.1] and Theorem 3 
show that satisfaction relation and definite satisfaction relation over H,,) are 
soundly reflected in Z and Zê over H [71], respectively. 
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Abstract. A business process is a collection of structured tasks corre- 
sponding to a service or a product. Business processes do not execute 
once and for all, but are executed multiple times resulting in multiple 
instances. In this context, it is particularly difficult to ensure correctness 
and efficiency of the multiple executions of a process. In this paper, we 
propose to rely on Probabilistic Model Checking (PMC) to automati- 
cally verify that multiple executions of a process respect some specific 
probabilistic property. This approach applies at runtime, thus the evalua- 
tion of the property is periodically verified and the corresponding results 
updated. However, we go beyond runtime PMC for BPMN, since we pro- 
pose runtime enforcement techniques to keep executing the process while 
avoiding the violation of the property. To do so, our approach combines 
monitoring techniques, computation of probabilistic models, PMC, and 
runtime enforcement techniques. The approach has been implemented as 
a toolchain and has been validated on several realistic BPMN processes. 


1 Introduction 


Business processes are structured tasks that model a specific service or prod- 
uct. Such processes are present in any company or institution worldwide, and 
there is a need for better controlling these processes to reduce costs and im- 
prove throughput. Many companies model their services and processes, thereby 
increasing their level of automation. One of the challenges in this context is to 
ensure the quality, correctness, and efficiency of these processes. In this paper, we 
assume that processes are described using Business Process Model and Notation 
(BPMN) [20], the standard business process modelling language. BPMN pro- 
cesses are not executed once but multiple times, resulting in multiple instances. 

In this study, we focus on quantitative analysis of processes, which is partic- 
ularly useful for computing probabilistic properties or other metrics related to 
time, costs or resource usage. More precisely, we use probabilistic model checking 
(PMC) to automatically verify that multiple executions of a process respect prob- 
abilistic properties [15]. In the context of BPMN processes, probabilistic proper- 
ties help verifying that some task usage does not go above a certain threshold or 
for computing how many resources have to be associated with specific tasks to 
execute the process smoothly. Evaluating a probabilistic property is strongly re- 
lated to the number of process instances being executed. Therefore, PMC should 
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be applied at runtime to analyse the current execution of running instances. The 
property is periodically verified, and the corresponding results are updated. 


In this paper, we not only verify probabilistic properties on BPMN processes 
using PMC at runtime, but also enforce the process executions to not violate 
the property. To do so, we rely on runtime verification and enforcement tech- 
niques. Runtime verification is a technique to verify whether system’s ex- 
ecutions satisfy a given correctness property at runtime. Runtime Enforcement 
(RE) is complementary to runtime verification and provides techniques 
that can intervene in the system at runtime to ensure that the behaviour of the 
system respects the expected properties. In this paper, the system consists in 
the multiple executions of a process and we want these executions to always 
satisfy a given property. This is possible by catching the flow of executions of 
these process instances and by changing it (when the property is violated) using 
correcting actions (such as buffering or reordering specific tasks). 


More precisely, we introduce probabilistic runtime enforcement, allowing 
BPMN processes to satisfy a given probabilistic property at runtime. To achieve 
this, we first convert the BPMN process into a formal model represented by 
a Labelled Transition System (LTS). We then monitor the multiple executions 
of the process and extract the corresponding traces (one trace per process in- 
stance). Based on these execution traces, we can annotate the LTS model of 
the process by adding execution probabilities to transitions of the LTS, thus ob- 
taining a Probabilistic Transition System (PTS) model. It is worth noting that 
recent actions are taken into account to compute this PTS but are not effectively 
released and considered executed. Probabilistic model checking is then used to 
verify whether the PTS model satisfies the given property. If the property is 
satisfied, all recent actions are released. If the property is violated, the enforce- 
ment mechanism is triggered and the aforementioned recent actions are retained, 
removed or re-ordered to avoid the property violation. This approach was fully 
implemented and its effectiveness was validated on several examples of processes 
and properties. 


The contributions of this work can be summarised as follows: 


— A novel algorithm, which analyses (possibly incomplete) execution traces 
and builds a Probabilistic Transition System. 


— A probabilistic enforcement mechanism, which avoids probabilistic property 
violation when executing multiple process instances. 


— An entire toolchain supporting the whole approach and its validation on 
realistic processes. 


The organisation of this paper is as follows. Section B] introduces the back- 
ground notions required to this work. Section [3] presents the probabilistic en- 
forcement approach for BPMN. Section |4| describes the toolchain automating 
all the approach steps, illustrates the approach with a case study, and presents 
experimental results. Section [5] surveys related work, and Section [6] concludes. 
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2 Background 


This section outlines the fundamental concepts, such as BPMN, Labelled Tran- 
sition System (LTS), Probabilistic Transition System (PTS), execution traces, 
and probabilistic properties. 


2.1 Business Process Model and Notation 


Business Process Model and Notation (BPMN) is a widely used workflow-based 
notation for describing and modelling business processes 20]. The syntax of a 
BPMN process is defined as a graph-based structure, where vertices or nodes 
represent various elements such as events, tasks, and gateways, and edges or 
flows connect these nodes. Figure [1] introduces the key elements of the BPMN 
notation. 


Initial Event End Event Task Flow 
(> —C) _ 
Split gateways: inclusive, exclusive, parallel Merge gateways: inclusive, exclusive, parallel 

m > 
a > 
Fig. 1: Excerpt from the BPMN notation. 


The diagram includes the initial event and the end event, which serve to 
initialise and terminate processes, respectively. It is assumed that there is only 
one initial event, which corresponds to the initiation of a process and at least one 
end event, which corresponds to the completion of a process. Task represents an 
atomic activity and typically has only one incoming flow and one outgoing flow, 
denoting the sequence of activities within the process. Gateways are used to 
describe the control flow of the process. There are two patterns for each gateway 
type: the split pattern and the merge pattern. The split pattern consists of a 
single incoming flow and multiple outgoing flows. The merge pattern consists of 
multiple incoming flows and a single outgoing flow. Several types of gateways 
are available, such as exclusive, parallel, and inclusive gateways. An exclusive 
gateway corresponds to a choice among several flows. A parallel gateway executes 
all possible flows at the same time. An inclusive gateway executes one or several 
flows. The choice of flows to execute in exclusive and inclusive gateways depends 
on the evaluation of data-based conditions. 

This paper focuses on the multiple executions of a single process, known as 
process instances. Each instance is characterised by an identifier and by the list 
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of tasks executed by this instance. It is assumed that each instance eventually 
completes, thus resulting in a finite list of tasks. 


2.2 LTS & PTS 


Labelled and Probabilistic Transition Systems are used in this paper as semantic 
models for BPMN. Moreover, they allow the automated analysis of the corre- 
sponding BPMN processes. 


Definition 1 (LTS). A Labelled Transition System (LTS) is a tuple (Q, X, dinit, 
A), where: Q is a finite set of states, X is a finite set of labels/actions, qinit is 
the initial state, A C Q x X x Q is a transition relation, where (q,a,q') E€ A 
represents a possible transition from state q to state q! with label a, also written 
q— qd. 


Probabilities are useful for making explicit the likelihood of executing specific 
tasks in a process. Therefore, we also use Probabilistic Transition Systems |23], 
an extension of the LTS model that incorporates probabilities for transitions. 


Definition 2 (PTS). A Probabilistic Transition System (PTS) is a tuple (S, A, 
Sinit, 0, P) such that (S, A, Sinit, ô) is a labelled transition system as per Defini- 
tion[1] and P : 6 — [0,1] is the probability labelling function. 


P(s Æ s’) € [0,1] is the probability for the system to move from state 
s to state s’, performing action a. For each state s, the sum of the probabil- 
ities associated with its outgoing transitions is equal to 1, that is Vs € S : 
X xes P(s, 4,8") = 1. When using LTS or PTS as a semantic model of a BPMN 
process, the set of labels or alphabet refers to the set of tasks appearing in the 
BPMN process. 


2.3 Execution Traces 


A process can be executed multiple times, resulting in multiple instances. Each 
process instance being executed can be in one of the following three states: wait- 
ing state, running/ongoing state, and completed state. Any (ongoing or com- 
pleted) instance consists of a sequence of tasks within the process. Every time 
an instance executes, it results in an execution trace of tasks. 


Definition 3 (Execution Trace). An execution trace (or) refers to a se- 
quence of tasks that are executed in a specific order by a specific process instance. 


It is worth noting that in the rest of this work, an execution trace can be 
completed or not. In the latter case, this is due to the fact that the process 
instance is still running and has not completed yet. 

Several operations can be performed on execution traces. Assuming an exe- 
cution trace o of length n and an execution trace o’ of length m, we define the 
following primitive operations: 
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— Size: Size(o) = |o]. 

— Index: gfi] is the ith element in ø, i < n 

— Slice: o[0...2] = o[0].o[1].--- oji — 1], i < n. 

— Concatenation: Concat(o, o) = o[0...n].0’[0...m]. 
— Reorder: Reorder(o,0') = 0'[0...m].o[0...n]. 


2.4 Probabilistic Properties 


The Model Checking Language (MCL) is a branching-time temporal logic 
that is suitable for expressing properties of concurrent systems using actions. It 
extends the alternation-free -calculus p] with regular expressions, data-based 
constructs, and fairness operators. A probabilistic property is a specification 
or requirement that expresses a probabilistic behaviour of a system or model 
being analysed. In this paper, probabilistic properties are used to describe the 
requirements for the probability of execution of a task or a set of combined tasks 
in a BPMN process. We use MCL to describe probabilistic properties using the 
prob R is op [ ? ] E end prob construct 24, where R is a regular formula that 
describes transition sequences, op is a comparison operator such as “<”, “<”, “>”, 
“>? “=”, “<>”, and E is a real number that represents a probability. Given an 
MCL probabilistic property and a PTS model, we use the CADP Probabilistic 
Model Checker in order to evaluate the property on the PTS model. 


3 Probabilistic Runtime Enforcement 


Our approach takes two inputs, a BPMN model and a probabilistic property, 
and produces as output a list of safe-to-execute tasks, in the sense that they 
do not violate the given property. This approach consists of three parts: the 
monitoring part, the transformation part, and the probabilistic runtime enforce- 
ment mechanism (Figure B). First, monitoring is used to observe the multiple 
executions of the given process, in particular to retrieve the tasks executed by 
each process instance (resulting in execution traces). Second, the input BPMN 
model is transformed into its corresponding semantic model, namely an LTS. 
This step is performed only once. Finally, the probabilistic runtime enforcement 
mechanism consists of two modules. The first module corresponds to Probabilis- 
tic Model Checking (PMC), which determines whether a new version of the PTS 
violates the given probabilistic property. The second module corresponds to the 
enforcer, which is activated only when the probabilistic model checking returns 
false. In such a case, the enforcer applies appropriate techniques to modify the 
input trace (e.g., by retaining some tasks and not executing them immediately), 
and thus avoid property violation. 


3.1 Monitoring 


Monitoring techniques are useful to observe and monitor the current status of 
the BPMN process executions. More precisely, we monitor process executions 
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Fig. 2: Approach Overview. 


from an instance perspective since the main goal is to extract all traces executed 
by ongoing process instances on a given period. 

Figure [3] illustrates the monitoring process of a BPMN process at runtime, 
which involves observing every generated instance for that process. Multiple 
instances can execute concurrently, and all information related to the execution 
of one process instance is stored in a database. To retrieve execution traces 
for all process instances, we rely on extraction techniques at varying levels of 
granularity. As shown in the figure, each instance execution trace is composed 
of a process ID, an instance ID, a set of tasks, a start time, and an end time. 
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Fig. 3: Runtime monitoring of multiple executions of a BPMN process. 


Since we focus here on long-running process executions, it does not make 
sense to retrieve all execution traces from the beginning. Therefore, the extrac- 
tion is triggered for a specific time window. This operation is repeated peri- 
odically, thus resulting in a sliding window algorithm. Algorithm [I] aims at ex- 
tracting the execution traces for all instances that are either in progress or have 
already finished during a specified time window. The algorithm takes as input 
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the process ID, the checkpoint timestamp, and the window duration. It first ini- 
tialises an empty list for the output traces. Then, it retrieves all execution traces 
associated with the process ID using the getTraces() method, which extracts all 
execution traces as illustrated in Figure |3} For each instance, it checks whether 
its endTime property is None (instance still running), or less than or equal to 
the start of the window. If so, it appends the execution trace to the output trace 
list. Finally, the algorithm returns as output a set of traces executed on that 
window. The time complexity of this algorithm is O(n), where n is the number 
of instances in the process. 


Algorithm 1 Get traces in the sliding window 


Inputs: Process ID PID, Checkpoint Timestamp ts, window duration td 
Output: Execution traces T 

1: T:=]] 

2: Tau := PID.getTraces() 

3: for each Tr € Tau do 


4: if Tr.endTime is None or Tr.endTime < ts — td then T .append( Tr) 
return T 


3.2 Transforming BPMN into LTS 


LTS is a semantic model that shows all possible execution paths for a process. 
To transform BPMN into LTS, we rely on an existing approach that first trans- 
lates BPMN into the LNT process algebraic specification language, and then 
transforms it into an LTS by using CADP compilers 7. For more information 
on the transformation process from BPMN to LTS, please refer to 2227. 


3.3 Transforming LTS into PTS 


The transformation process from an LTS to a PTS consists of two steps. The 
initial step aims at traversing all provided instances and identifying all the possi- 
ble execution paths for each instance (Algorithm [2}. In a second step, a counter 
is added to each transition of the LTS, thus allowing us to track the number of 
times each transition is executed. This facilitates the calculation of the proba- 
bility value associated with executing each transition. Finally, the output model 
is represented as a PTS (Algorithm [3). 

An execution path is a sequence of transitions in the LTS that matches 
with the execution trace of an instance. When an instance has been successfully 
completed, there exists only one corresponding execution path. The LTS may 
exhibit non-deterministic behaviour due to the presence of inclusive gateways in 
the BPMN model. Therefore, when considering unfinished instances, we calculate 
the execution probabilities of all relevant paths and normalize these probabilities. 

Algorithm |2| takes as input an LTS and an execution trace of an instance 
Tiasks (i.e. a list of tasks), and finds all feasible execution paths in the LTS that 
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satisfy the given execution trace. The algorithm uses a depth-first search (DFS) 
approach to traverse the LTS, starting from the initial state. It compares the 
tasks in the transitions of the LTS with the tasks in the ordered sequence of tasks 
to determine feasible paths. The algorithm maintains a stack to keep track of the 
current state and partial paths, and recursively explores all possible transitions 
from the current state until it reaches a state that fully matches the ordered 
sequence of tasks. Given that it is a non-deterministic model, it then backtracks 
to explore other possible transitions and continues the exploration process until 
all paths have been exhaustively explored. The time complexity of the algorithm 
is O(|Q| x |4|), where |Q| represents the number of states in the LTS and |A| 
represents the number of transitions in the LTS. 


Algorithm 2 Get all execution paths of an instance in LTS (FINDPATHS) 


Inputs: LTS = (Q, X, qinit, A), an execution trace Tiasks = [t1, t2,.--, tn] 
Output: A list of paths (resultPaths) 
1: resultPaths := |] 
return DFS(LTS, Tiasks, Ginit; |], resultPaths) 


2: function DFS(LTS, tasks, qcurrent, currentPath, resultPaths) 
3 if Size(tasks) == 0 then 

4 return resultPaths. append (currentPath) 

5 else 

6: task := tasks[0]; rest Tasks := tasks[1:] 

7 Qnext = {d = Q | (current ; task, g) € A} 

8 for all qnezt E Qnext do 

9 nextPath := currentPath 

0 nextPath. append ((qeurrent, task, nest )) 

1 DFS(LTS, restTasks, dnext, nextPath, resultPaths) 


m 


Algorithm [|3| takes as input an LTS and a list of execution traces Z, and 
computes a PTS representing the probability distribution of transitions between 
states of the LTS based on the occurrence of tasks in the set of execution traces. 
The algorithm first initialises a counter for each transition in the LTS, which 
records the number of times the transition is taken in the execution trace (line 1). 
Then, for each execution trace in the list, the algorithm computes the set of pos- 
sible execution paths in the LTS that correspond to the execution trace (line 5). 
If there is only one path, the algorithm increments the counter for each transition 
in the path by 1 (lines 6 to 7). If there are multiple paths, the algorithm incre- 
ments the counter for each transition in each path by 1, but also keeps track 
of the number of execution traces that have multiple paths to avoid double- 
counting (lines 10 to 11). Finally, the algorithm computes the probability of 
each transition by dividing its counter by the sum of counters for all transitions 
with the same source state and event (line 12). The resulting probabilities are 
normalised so that they sum to 1 (line 13). The algorithm returns the PTS, 
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which consists of the set of states, tasks, and transitions of the LTS, along with 
the computed probabilities for each transition. The time complexity of this al- 
gorithm is O(|Z| x |Q| x 14|), where |Z| is the number of execution traces, |Q| 
represents the number of states in the LTS, and |A| represents the number of 
transitions in the LTS. 


Algorithm 3 Computation of PTS (COMPUTEPTS) 


Inputs: LTS = (Q, X, dinit, A), a list of execution traces Z = |I, I2,..., In] 

Output: PTS = (S, A, Sinit, 6, P) 

1: for each (q,a,q') € A do ent((q,a,q’)) := 0 

2: Paths := [|], counter := 0 > counter records the number of unfinished traces 
3: for all J; € T do 

4: Tiasks = Ti.getTasks() 

5: Paths := FINDPATHS(LTS, Titasks) > FINDPATHS (Algorithm [2) 
6 if Size(Paths) == 1 then 
7 for each (s,a, s’) € Paths[0] do cnt((s,a,s')) := cnt((q,a,q')) +1 
8: else 
9 
0 
1 


counter := counter + 1 
for each Path € Paths do 
for each (s,a,s’) € Path do cnt((s,a, s’)) := cnt((g,a,q')) +1 


12: P := {(s,a,s’) + ent((s,a,s’))/ > calculate probabilities 
(Sos ae pen = Ua eae 
13: P := Normalisation(P) 
return (S, A, Sinit, 6, PY 


3.4 Critical Tasks 


In this subsection, we describe how to define and compute critical actions/tasks 
given an LTS model of a BPMN process and a probabilistic property. Critical 
tasks refer to specific tasks that play a crucial role in determining whether a 
system’s behaviour violates or satisfies a given property. This notion is at the 
heart of the enforcement techniques presented in the next subsection. 

The notion of critical task used here is inspired by the notion of last action 
of the property introduced in [16]. This paper states that the violation of a 
property by a given model is somehow triggered when the last action of the 
property is executed by the model. In other words, if the last action is not 
executed, the model does not violate the property. Depending on the actions 
used in the probabilistic property (including the last action), we can identify 
one or more execution paths in the LTS, including the actions of the property, 
where each path consists of an ordered list of transitions. We then traverse this 
set of paths and for each path we search for the last state (the closest to the end 
of the path) corresponding to a choice between several transitions. This state 
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s is particularly important because it is the last opportunity to avoid reaching 
the last action (of the property) and thus violating the property. The actions or 
tasks for all transitions outgoing from state s are candidates to critical tasks. At 
this point, the operator of the property needs to be considered. If the operator is 
less than ("<" or "<"), there is one critical task, corresponding to the transition 
outgoing from s and leading to the last action. If the operator is greater than 
(">" or ">"), the critical tasks correspond to all transitions outgoing from s 
and leading to actions other than the last one. If the operator is "=" or "<>", 
the critical tasks correspond to all tasks appearing on transitions outgoing from 
s. 


Algorithm 4 Computation of critical tasks in LTS (COMPUTECRITICALTASKS) 


Inputs: LTS = (Q, X, dinit, A), Probabilistic property (pp) 

Output: A set of Critical Tasks (CTasks) 

1: CTasks := {}, Tiasks := pp.getTasks() 

2: Paths := FINDPATHS(LTS, Tiasks) > FINDPATHS (Algorithm [2p 
3: for each path € paths do 

4 reversedPath := REVERSE(path) 

5 for each transition (s, task, s’) in reversedPath do 

6: A, C {(s,a,q) € A| gE Q} 
7 
8 


if Size(A,) > 1 then 
if pp.operator() is” >” or” >” then 


9: CTasks := CTasks U {a € X \ task | 3q € Q, (s,a,q) € As} 
10: else if pp.operator() is” <” or” <” then 

11: CTasks := CTasks U {task} 

12: else 

13: CTasks := CTasks U {a € X | 3q € Q, (s,a, q) € As} 

14: break 


return CTasks 


Algorithm |4| presents a method for computing the critical tasks ( C'Tasks) 
given an LTS and a probabilistic property (pp). The algorithm starts by initial- 
ising CTasks as an empty set and extracts the set of all tasks Tiasks included 
in the probabilistic property. Next, it calls FINDPATHS (Algorithm [2} to find all 
paths in the LTS that include the tasks in Tiasks (line 2). For each path found, 
the algorithm reverses it and iterates over the transitions in reverse order. For 
each transition ¢ represented as (s, task, s’), the algorithm selects the set of out- 
going transitions from state s in the LTS, denoted by A, (line 6). If the size of 
A, is greater than 1, the algorithm checks the operator specified in pp (lines 7 
to 13). If the operator is either > or >, the algorithm adds to CTasks the set 
of all actions a in X that have outgoing transitions from state s and do not 
correspond to the task in task (lines 8 to 9). If the operator is < or <, the algo- 
rithm adds the task task to C'Tasks (lines 10 to 11). Otherwise, the algorithm 
adds to CTasks the set of all actions a in X that have outgoing transitions from 
state s (line 13). Finally, the algorithm breaks out from the loop for the current 


66 Yliés Falcone , Gwen Salaün, and Ahang Zuo 


path. The algorithm returns the set of critical tasks C'Tasks as output. The time 
complexity of this algorithm is O(f(n) x |A|), where f(n) is the time complexity 
of the FINDPATHS algorithm and |A| is the number of transitions in the LTS. 


3.5 Probabilistic Runtime Enforcement (PRE) 


The enforcement mechanism (EM) requires as input a probabilistic property y 
and an LTS (Fig. p). It is triggered right after the monitoring component. At 
runtime, it periodically receives a list of execution traces and a list of waiting 
tasks (waiting to be executed) from the monitoring component, and produces 
as output a list of tasks (to be executed) whose execution does not cause the 
violation of the probabilistic property, as verified using PMC techniques. 


Probabilistic 


Us Property @ 


r 1 H 4 
| Execution traces | Enforcement ! ] 
f o ; o'(a' Fg)! List of tasks i 
i + pe Mechanism ] 
tore an i p | (tobe executed) : 
' List of waiting tasks i | Buffer ! f 


Fig. 4: Overview of PRE. 


The enforcement techniques used in this paper rely on two operations: re- 
ordering and buffering. Reordering techniques correspond to a change in the 
order of application of some of the tasks received as input. Buffering techniques 
rely on a FIFO buffer B, which stores critical tasks when necessary. Buffering 
techniques aim at delaying the execution of specific tasks by adding them tem- 
porarily to the buffer 6 and taking them out of the buffer when their execution 
does not induce the violation of the property. 

Algorithm [5| presents the enforcement mechanism in detail. The algorithm 
takes as input a list of (waiting) tasks, a probabilistic property y, and an LTS. 
It returns a list of tasks to be executed (in the best case, the same sequence of 
tasks given as input) that satisfies y. The idea is to update the PTS by merging 
the execution traces and the tasks to be executed (waiting tasks and tasks in 
the buffer), and to use PMC techniques to determine whether these new tasks 
would still preserve the satisfaction of the property. If the executions of these 
tasks would violate the property, buffering or reordering techniques are triggered. 

The algorithm is initialised when the EM is called for the first time. Initiali- 
sation consists of (i) computing the critical tasks using the COMPUTECRITICAL- 
TASKS algorithm (Algorithm |4) and storing them in the global variable ct, and 
(ii) initialising the buffer B to empty. The COMPUTECRITICALTASKS algorithm 
computes the tasks of the process that can avoid the property violation and thus 
will be stored in the buffer 6 by the enforcer when necessary. When the enforce- 
ment mechanism is used for the first time, the list of tasks to be processed only 
consists of the waiting tasks. Later on, each time enforcement is used, the list 
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Algorithm 5 Enforcement Mechanism 


Inputs: a list of execution traces 7, a list of waiting tasks oy, a probabilistic property 
y, an LTS. 
Output: a list of tasks to be executed o% 


1: if EM is not initialised then > ct and B are Global variables. 
2: ct := COMPUTECRITICALTASKS(LTS, 9) > Algorithm [4] 
3: B:=|],o:=o7 > Initialise Buffer B 
4: else 

5: Obuffer := (task | task € B.getTasks()) > All tasks in Buffer 
6: o := Concat (buffer, OT) > Concatenation 


return o7 := EM(LTS, T, o, p, ct) 


7: function EM(LTS, T, o, p, ct) 
8: if CuECK(LTS, 7, o, p) then 


9: os := (task | task € o A task € B.getTasks()) 

10: RemovefromBuffer(o; ) > Buffering: (Remove) 
11: return o 

12: else 

13: o1 := (task | task € o A task € ct), o2 := (task | task € o A task ¢ 01) 

14: or := Reorder (01,02) > Reordering 
15: if CHECK(LTS, T, or, p) then 

16: Os := (task | task € or ^ task € B.getTasks()) 

17: RemovefromBuffer (os) > Buffering: (Remove) 
18: return o, 

19: else 
20: o’, o” := BISECTION(01) > Binary-Search 
21: Oa := (task | task € o” A task ¢ B.getTasks()) 
22: AddtoBuffer(oa) > Buffering: (Add) 
23: o» := Concat(c2, 0’) > Concatenation 
24: EM(LTS, T, o, p, ct) 
25: function CHECK(LTS, T, o, p) > Probabilistic model checking 


26: return UpDATEPTS(LTS,7,c) F » ? true : false 


27: function UPDATEPTS(LTS, T, o) > Transforming LTS into PTS 
28: T:= |] 

29: for each task € o, in order do I := task.getInstance() > I: Execution trace 
30: I.append(task), Z.append(1) 

31: for each T € T do I := r.getInstance() 

32: if I ¢ T then Z.append(I) 

33: return COMPUTEPTS(LTS, Z) > COMPUTEPTS (Algorithm |3) 
34: function BISECTION(c) > Binary-Search 


35: n := Size(o); m := |n/2] 
36: return o(0...m], o[m...n] 
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of tasks to be processed is obtained by concatenating all the tasks in the buffer 
with the tasks in the waiting list (line 6). Function EM then starts processing 
this list of tasks. The CHECK function first verifies whether the given execution 
traces and the given list of tasks satisfy the property by using PMC. If this func- 
tion returns true, all the tasks are removed from the buffer and the algorithm 
returns the tasks in the buffer and the waiting tasks (lines 8 to 11). Otherwise, 
the enforcement techniques are triggered. First, reordering techniques are ap- 
plied as follows. The list of tasks is reordered by favouring (and thus executing 
first) the non-critical tasks, which are placed at the beginning of the list. Then, 
the PTS is built again, and PMC called to check whether ordering differently 
the tasks to be executed avoid the property violation (line 15). If the result is 
true, the buffer is emptied, and the list of tasks is returned. If the result is false, 
reordering techniques are not enough, and in such a case, the mechanism then 
executes some of the tasks only partially. To identify the subset of tasks that 
can be executed without violating the property, we use the BISECTION func- 
tion (lines 34 to 36). This function helps to avoid an exhaustive exploration of 
all possible combinations of tasks (and calling PMC for each solution), which 
would be too costly and time-consuming. This function divides the list of critical 
tasks into two parts. The algorithm then puts the second part into the buffer 
and recursively calls the EM function for this new list of tasks, which is the 
list of non-critical tasks (computed on line 13) concatenated with the first part 
returned by the BISECTION function (lines 20 to 24). The algorithm ends when 
the verdict of PMC is true and returns a list of safe-to-execute tasks. 

The time complexity of this algorithm is O(log |o7| x f(|o7|)), where |o7| is 
the size of the given list of tasks, and f(|o7|) represents the time complexity of 
using PMC. 


3.6 Characteristics 


This paper proposes enforcement mechanism that is online, untimed, and opera- 
tional, meaning it utilises real-time system traces, disregards physical time inter- 
vals, and offers a practical implementation guide. This mechanism has three main 
characteristics: soundness, monotonicity, and transparency. PRE refers to the 
probabilistic enforcement mechanism, PRE.buff is the buffer B, ~E(PRE.buff) 
means that the buffer was not triggered, PRE.out refers to the output of the 
mechanism, and CHECK refers to the probabilistic model checking function. 

Proposition [i]states that the tasks in each trace generated by the mechanism 
do not violate the properties of the system by their execution. 


Proposition 1 (Soundness) 
Vo : PRE(LTS,7,¢,).out = o} = CHECK(LTS,7,07,) == true 


Proof (Sketch). If the PMC’s verdict is false, the execution monitor does not 
produce any tasks as output to maintain soundness. 

Proposition [2] states that the enforcer’s output sequence consistently grows 
with respect to the number of non-critical tasks in the input sequence. 
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Proposition 2 (Monotonicity) 
Vt eo,t' €o',t,t ¢ ct: size(a) < size(o’) => size(PRE(LTS,7,o, y).out) < 
size(PRE(LTS,7, o’, y).out) 


Proof (Sketch). The buffer exclusively stores critical tasks. Therefore, as the 
number of non-critical tasks in the input increases, the length of the output of 
the mechanism also increases. 

The execution monitor is transparent, which means that it only intervenes if 
the input tasks to be executed violate the property. 


Proposition 3 (Transparency) 
PRE(LTS, 7,0, y).out = o7, -E(PRE.buff) —> PRE(LTS, T, 07, y).out = o 


Proof (Sketch). Since there is no suppression operation in the enforcement mech- 
anism, all tasks in the input ø are the same as in the output o4 when the buffer 
is not triggered. 


4 Tool Support & Evaluation 


This section first presents the toolchain that automates the different steps of 
our approach. We then provide a practical illustration of the approach and tools 
using a case study. Finally, additional experiments are presented to evaluate the 
tools’ performance on a series of realistic examples. 


4.1 Tool 


Figure B]gives an overview of the toolchain. As far as the inputs are concerned, we 
rely on the open-source tool Activiti |2| to specify and execute BPMN processes. 
Probabilistic properties are described using MCL. The monitoring techniques 
are implemented in Java and aim at extracting the required information about 
execution traces from a MySQL database. The transformation from BPMN pro- 
cesses to LTS models is performed using an open-source tool called VBPMN (21). 
The annotation of the LTS model with probabilities, thus resulting in a PTS 
model, is implemented in Java. PMC is computed using the CADP probabilistic 
model checker, which takes as input an MCL probabilistic property and a PTS, 
and returns a Boolean value. Finally, the enforcer is also implemented in Java 
and applies the correction when necessary on the input flow of tasks using the 
techniques (reordering and buffering) presented in Section 


4.2 Case Study 


The approach is illustrated using the shipment process of a hardware retailer [25]. 
Figure [6] shows the BPMN process of this example, whose final goal is to de- 
liver goods. More precisely, this process starts when there are goods ready for 
shipment. Two tasks are then executed concurrently: one involves packaging the 
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Fig. 5: Toolchain overview. 


goods (T7) while the other determines whether a normal or special shipment 
is required (T1). Based on that decision, the first option verifies the need for 
additional insurance (T2), followed by the opportunity to purchase additional 
insurance (T4) and/or complete a post-label (T5). Another option is to request 
quotes from carriers (T3), followed by assigning a carrier and preparing the pa- 
perwork (T6). Finally, the package is transferred to a designated pick-up area 
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Fig. 6: BPMN shipment process of a hardware retailer. 
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For illustration purposes, we choose a property checking that the probability 
of executing task T4 after task T2 is less than 0.5. This is important because 
the choice of taking extra insurance (T4) comes with a cost, and if this decision 
is taken too often (more than half of the time here), this could result in high 
expenses on a short period of time. This property is expressed in MCL as follows: 
prob true*. T2. true*. T4 is < 0.5 end prob. As the question mark symbol is used, 
the model checker returns a Boolean value indicating the property’s truthfulness 
and a numerical value representing the probability of executing T4 after T2. 
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Fig. 7: Experiments on the case study without enforcement. 


We have conducted two series of experiments with this running example, 
one without the enforcement mechanism (results are shown in Figure m) and 
the other with enforcement (Figure [S}. The same randomized workload of 2000 
instances was used for each experiment. These experiments show that, without 
enforcement techniques, there is a 7% risk of violating the property, me ina 
satisfaction rate of 93%. In other words, the property is violated 7% of the time, 
which corresponds to the situations where the curve goes above the probability 
threshold represented as an horizontal line in Figure [7| On the other hand, 
Figure [8] shows that with enforcement, the instance executions keep satisfying 
the given probabilistic property, resulting in a 100% satisfaction rate and no 
violation of the property. In practice, this allows one to delay payment of extra 
insurance over time and thus avoids peaks of extra expenses. 
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Fig. 8: Experiments on the case study with enforcement. 
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4.3 Experiments 


The goal of this section is to evaluate the correctness and performance of the 
enforcement approach. The correctness is calculated as the percentage of prob- 
abilistic properties violated during the running process, while the performance 
is measured by the average execution time (AET) of an instance. AET is com- 
puted by summing the execution time of each instance and by dividing this value 
by the number of instances. To conduct these experiments, we relied on a set 
of BPMN processes taken from the literature. Each process was executed 1000 
times, resulting in 1000 instances. The time taken between the startup of two 
new process instances was computed using an exponential distribution with a 
lambda value of 5 (A = 5). These experiments were performed on an Ubuntu OS 
laptop with a 1.7 GHz Intel Core i5 processor and 8 GB of RAM. 


The results of these experiments are presented in Table |1| Each row gives 
the results for a given process by providing a description, its size in terms of 
number of tasks and gateways, the size of the corresponding LTS in terms of 
number of states and transitions, the correctness results without (a) and with 
(b) enforcement, and the AET without /with enforcement. The correctness value 
corresponds to the satisfaction rate as a percentage (%). The second is described 
as the unit of time for AET. 


Table 1: Experimental results for some case studies. 


BPMN Characteristics PTS 
No. Correctness AET (s) 
Process 


Tasks Gateways States Transitions 


a) 93% 0.65 
1 Shipment 8 2@412Q12®@ 18 38 a 

b) 100% 1.38 
(a) 47% 0.68 

2 Shipment [25 8 2 2 2 18 38 
De® iene aes 
(a) 93% 0.94 

3 Shopping |22] 22 8 2 2 59 127 
De® ie ate 
(a) 54% 0.97 

4 Shopping |22] 22 8 2 2 59 127 
De® ie: a 
, (a) 89% 0.56 

5 AccoutOpening |22] 15 3 2 2 20 33 
L202 ® dow A 
(a) 96% 1.98 

6 Online-Shop |22 19 7 +2 36 74 
wae’. ») 100% 482 
(a) 85% 3.42 

7 Multi-Inclusives [22] 8 6@ 141 1201 
b) 100% 11.44 
88% 2.42 


8 Booking 11 210 53 252 (a 


b) 100% 6.17 
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Table [1] first shows that without enforcement techniques, the resulting cor- 
rectness results present a satisfaction rate below 100%, whereas this rate is sys- 
tematically of 100% when enforcement is used. As for AET, the execution time 
is longer when using enforcement techniques. The time increases when the per- 
centage of satisfaction of the property decreases. For instance, examples 1 and 2 
use the same process but different properties. The percentage of property viola- 
tions of example 1 is lower than example 2; therefore, the latter takes more time 
when using enforcement because it takes more time for the process instances to 
complete. Similar results can be observed for examples 3 and 4. Although the 
enforcement mechanism increases the execution time of the process, it system- 
atically ensures that the process executes while preserving the given property. 


5 Related Work 


In this section, we first compare with existing works on probabilistic verification 
of business processes, and then we focus on enforcement techniques. 

The approaches proposed in deal with Bayesian networks to infer the re- 
lationship between different events. As an example, the authors in |6| introduce a 
BPMN normal form based on Activity Theory that can be used for representing 
the dynamics of a collective human activity from the perspective of a subject. 
This workflow is then transformed into a Causal Bayesian Network that can be 
used for modelling human behaviours and assessing human decisions. In (isiji9], 
the authors present a framework for modelling and analysing business workflows. 
These workflows are described with a subset of BPMN extended with probabilis- 
tic nondeterministic branching and general-purpose reward annotations. An al- 
gorithm translates such models into Markov Decision Processes (MDP) written 
in the syntax of the PRISM model checker. This enables quantitative analysis 
of business processes for properties such as transient /steady-state probabilities, 
reward-based properties, and best- and worst-case scenarios. These properties 
are verified using the PRISM model checker. This work supports design time 
analysis but does not focus on the dynamic execution and runtime verification 
of processes. The approach in extends BPMN with time and probabilities. 
Specifically, the authors expect that a probability value is provided for each flow 
involved in an inclusive or exclusive split gateway. These BPMN processes are 
then transformed to rewriting logic and analysed using the Maude statistical 
model checker PVeStA. The authors in propose to compute probabilities 
from execution traces of executable BPMN and apply probabilistic model check- 
ing techniques at runtime to analyse a given property. In this work, we also rely 
on PMC, but we go beyond the analysis of BPMN processes, because when the 
property is not satisfied, we apply techniques for enforcing the satisfaction of 
the property. 

As far as runtime enforcement is concerned, existing techniques usually rely 
on common techniques including buffering, reordering, healing and discarding 
actions or events oari. Buffering rely on storing events that violate cer- 
tain property in a buffer, which helps delaying their execution. Reordering was 
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used in several works for favouring or delaying the execution of some actions. 
Healing is a technique that enforces properties by repairing or inserting new 
events to ensure compliance. Suppression of events ensures property enforce- 
ment by discarding specific events. In the context of BPMN processes, removing 
specific tasks or artificially adding other tasks is meaningless due to the overall 
goal of the running processes, explaining why we made use of reordering and 
buffering techniques only. The authors of focus on developing runtime 
enforcement techniques for timed properties, without targetting any specific ap- 
plication area. In (7, the authors study runtime monitoring and enforcement of 
first-order LTL properties over data evolution using an automata-based tech- 
nique. Their approach is based on the construction of a first-order automaton 
that is able to perform the monitoring incrementally and by using exponential 
space in the size of the property. This theoretical work does not focus on BPMN 
probabilistic processes, nor on probabilistic properties. 


6 Conclusion 


In this paper, we have proposed a probabilistic execution enforcement mechanism 
for BPMN processes at runtime. The BPMN process is first transformed into an 
LTS model. This model is periodically annotated with the execution probability 
of each transition in the LTS, resulting in a PTS model. This step is achieved 
by supervising the multiple executions of the BPMN process and extracting the 
corresponding execution traces. When new instances are triggered, new tasks 
are waiting to be executed. We check whether the execution of these tasks will 
not violate the given probabilistic property. If it is the case, the enforcement 
techniques are activated by either buffering or reordering tasks in order to avoid 
the violation of the property. All the steps of the approach are automated by a 
toolchain consisting of tools we implemented or reused. Experiments show the 
correctness of the approach, which preserves the truthfulness of the property, and 
a slight overhead in terms of performance, which comes from the time needed to 
apply enforcement techniques. 

The two main perspectives of this work are as follows. The first one is to 
extend the PRE mechanism in order to minimise the frequency of verifications 
by considering the PMC results. The second future work focuses on applying 
PMC results to dynamically adjust the resource allocation necessary for efficient 
process execution. 
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Abstract. The correct operation of safety-critical cyber-physical sys- 
tems is crucial. However, such systems often feature a large variability 
of start configurations, an intractably large state space, a high degree 
of uncertainty, or inherently unsafe behavior. A model of the expected 
system behavior starting in the current state can be used by look-ahead 
controllers to derive control decisions to avoid paths to safety violations 
when possible. However, the computational effort for deriving and ana- 
lyzing the future system behavior is exponential in the look-ahead. 

In this paper, we employ Graph Transformation Systems (GTSs) for the 
modeling of expected system behavior. We then combine design-time and 
run-time control synthesis based on Supervisory Control Theory (SCT) 
achieving an exponential cost-reduction for a given controller look-ahead. 
For a fixed required reaction time of controllers, much longer look-aheads 
may therefore be employed. To illustrate and evaluate our approach, we 
consider a system where shuttles must avoid collisions with ambulances 
at level crossings. 


Keywords: cyber-physical systems, self-adaptive systems, supervisory control, 
model-predictive control, runtime verification, bounded model checking 


1 Introduction 


Cyber-physical systems in which software components operate in a physical en- 
vironment often encompass complex concurrent behavior. The development or 
synthesis of such control software achieving a given set of goals while also ensur- 
ing the satisfaction of a given safety-specification is crucial. In model-predictive 
control, a model of the expected system behavior is employed to obtain look- 
ahead controllers. Such controllers derive control decisions based on the set of all 
behavior sequences of a chosen look-ahead length starting in the current state. 
However, the set of such behavior sequences is exponential in the look-ahead 
length limiting the look-ahead to values allowing admissible reaction times. 

As a running example, we consider a variation of the RailCab system from 
[80]. In this system, shuttles navigate on a large-scale track topology, which 
intersects with a road topology at level crossings. Ambulances, which can be 
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monitored by shuttles with a certain degree of uncertainty, navigate on the road 
topology and may traverse level crossings. The shuttle control to be derived, 
must avoid collisions with ambulances when possible by adjusting the speed of 
the shuttle taking potential ambulance behavior into account. To focus on our 
approach and to simplify our presentation, we reduce the possible number of 
steps of actors in the system model by employing a small topology fragment 
with one level crossing, a single shuttle, and one ambulance. 


Besides run-time efficiency, controller synthesis approaches for cyber-physical 
systems must solve an array of further problems. P1 (Sets of Start States): The 
start state of the system is often not precisely known requiring the consideration 
of a large or even infinite set of start states. These start states may differ in 
rigid components but also in the number, the state, and the interconnection 
of active components. For our running example, the underlying rigid topology 
and the location of shuttles and ambulances on this topology may vary greatly. 
P2 (State space explosion): Even when selecting a single start state, the state 
space of the system is often intractably large or even infinite because all steps of 
all components must be captured in the system model. P3 (Uncertainty): The 
uncontrolled part of the system can often not be modeled faithfully at design 
time due to uncertainty. For example, uncertainty arises due to behavioral or 
configuration adaptation as well as from unknown, unreliable, or unpredictable 
components/actors (such as humans) performing additional steps that cannot be 
foreseen at design time or fail to perform such steps [45]. P4 (Unsafe Systems): 
Avoidance of unsafe states is not always feasible due to uncertainty or in contexts 
where unsafe states cannot be avoided by control at all. 


For the modeling of the expected future system behavior, we employ Graph 
Transformation Systems (GTSs), which can be used when system states can 
be captured by graphs and when the steps of the involved components can 
be captured using local graph modifications. In the past, various GTS-variants 
have been developed and employed for the ad esign, and ee of such 

22} 


systems in an abundance of publications such as [L9, 


focusing on different system aspects and requirements. 


To accommodate for these problems (discussed in more detail in the sub- 
sequent section), we propose a model-driven approach based on GTSs and the 
MAPE-K control framework where we employ a sliding window technique consid- 
ering actor-specific state fragments to reduce the computational effort (problems 
P1 and P2) and combine design-time control synthesis with run-time control syn- 
thesis as a look-ahead extension technique to efficiently obtain best-effort control 
(to tackle problems P3 and P4). Both, at design-time and run-time, we employ 
an extension of Supervisory Control Theory (SCT) with priorities for the syn- 
thesis of controllers where the uncontrolled system is modeled using an extension 
of GTSs with controllability notions. 


This paper is structured as follows. In we discuss our conceptual 
approach in the context of the MAPE-K framework including the sliding win- 


dow technique. In we consider related work. In we present 
our extension of SCT with priorities. In we integrate controllability 
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Fig. 1. Overview of MAPE-K-based approach 


notions into the GTS framework and present our running example. In 
we discuss control synthesis at design-time. In [ection 7| we discuss control syn- 
thesis at run-time based on the design-time results. In|section 8] we evaluate our 
approach for a larger case study. Finally, in [section 9] we conclude the paper and 
provide an outlook on future work. 


2 MAPE-K Closed-Loop Approach 


Software being executed in a cyber-physical system on a device often follows (at 
least implicitly) the MAPE-K closed-loop design depicted in 
developed for systems with a high degree of complexity, uncertainty, and dy- 
namicity. Such software interacts with its context in that system via sensors and 
effectors and keeps a Runtime Model (RTM) to store its local state across its 
looped executions. It executes (a) the monitoring phase to react to sensor infor- 
mation by updating the RTM accordingly, (b) the analysis phase to determine 
the impact of the most recent events on its options to achieve its control goals, (c) 
the planning phase to derive a control plan satisfying suitable quality standards, 
and (d) the execution phase to send events to the effectors to implement the 
steps of the derived control plan. Ideally, such a MAPE-K control architecture 
adapts to unexpected situations at run-time in an ad-hoc manner. 


In our approach, the RTM (see|Figure 1b) contains (a) a Bounded Forward 
State Space (BFSS) from the current system state s (derived and maintained 
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at run-time) and (b) a Bounded Backward State Space (BBSS) from unsafe 
states us (derived at design-time). Both of theses state spaces are (similarly 
to bounded model checking [50]) derived from the GTS capturing the expected 
system behavior. Moreover, the RTM contains the controllers derived from these 
two state spaces, which capture for each depicted state the exiting steps that the 
shuttle may perform. At run-time the controller obtained from the BFSS and 
the BBSS are combined by attempting to identify boundary graphs of the BBSS 
in the leaf states of the BFSS. For a BFSS and BBSS of depth n and k, this 
combination grants an effective look-ahead of n+ k to the controller. Clearly, the 
look-ahead should be maximized (taking other aspects such as required response 
time into account) to provide the controller synthesis procedure with as much 
information as possible to avoid the execution of overly conservative behavior 

(such as unnecessarily slowing down the shuttle). Not employing a BBSS only 

constructing a BFSS of depth n + k to achieve the same look-ahead n + k would 

be exponentially more expensive and, moreover, this additional cost would be 
incurred at run-time whereas at least the BBSS is obtained in our approach at 
design-time rendering its cost of construction negligible. 

In our approach, the four MAPE phases are as follows. 

e Monitor phase: when the controller is informed via its sensors about a state 
change from the BFSS root s to state s’, it selects s’ as the new root of the 
BFSS. Unless the step to s’ was not expected due to uncertainty, s’ is already 
one of the successors of s contained in the BFSS. 

e Analysis phase: States of the BFSS unreachable from s’ are removed and the 
GTS model is used to re-extend the BFSS to the chosen depth n. To identify 
states to be avoided, all leaf states of the BFSS are checked for occurrences of 
unsafe boundary states of the BBSS. Finally, the run-time controller is then 
adjusted to the modified BFSS by selecting steps to be prevented that would 
lead to the states to be avoided. 

e Planning phase: The controller can then plan the execution of any controllable 
step exiting the new root state s’ of the BFSS (in the running example, these 
steps are the steps of the shuttle) or let the plant perform the next step[] 

e Execute phase: If a step has been selected in the planning phase, this step is 
send for execution to the corresponding effector (in the running example, a 
hardware controller of the shuttle will receive and implement such a signal). 

The worst-case controller response time depends on the time required for (a) 

the full reconstruction of the BFSS and the corresponding controller synthesis 

thereon (upon an occurrence of an unexpected step) and (b) the identification 
of leaf states of the BFSS containing unsafe boundary states of the BBSS. The 


1 The absence of such controllable steps does not indicate a problem as the controller 
may just not need to change the behavior of the agent (e.g., the shuttle may already 
be driving at the desired speed) but, in the considered time-abstract setting, the 
absence of any step implies that no control strategy guaranteeing the avoidance of 
unsafe states could be obtained. In this case, fallback behavior such as not modeled 
emergency maneuvers or decisions by the environment on uncontrollable events may 
still result in the avoidance of unsafe states. 
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usage of the BBSS exponentially reduces the computational effort for (a) as 
discussed but, regarding (b), it also requires that the leaf states of the BFSS 
need to be checked against a potentially large number of unsafe boundary states 
instead of only the unsafe states. In our evaluation in Section 8] we measure and 
further discuss these effects for a considered case study. 

As mentioned in the introduction already, we employ a sliding window ap- 
proach reducing the size of the BFSS and BBSS to be constructed. Instead of 
assuming that each agent maintains a perspective on the entire system state, we 
adopt the technique from where, in a compositional approach, agent-specific 
scopes are used. On the one hand, this greatly reduces the number of steps (and 
thereby the size of the BFSS and BBSS) as only a small number of agents will 
be typically in the view range of an agent. On the other hand, a smaller view 
range may result (closely related to the look-ahead) in an overly conservative 
controller behavior. Besides mitigating the effect of state space explosion, this 
sliding window approach has the additional advantage that start states must 
only be determined for each actor individually and not globally. Intuitively, each 
system step must be followed by suitable postprocessing to update the reached 
state to the view range of the actor. These postprocessing steps are part of the 
system model and therefore define changes in the context of the agent to which 
the controller must suitably respond. In our evaluation in we further 
discuss this sliding window technique as we abstract from it in our running 
example to focus on controller synthesis via BFSS and BBSS. 


3 Related Work 


Model checking is often inadequate for complex systems due to the state 
space explosion problem and uncertainty. Bounded Model Checking (BMC) 
has been devised to reduce analysis costs providing, however, weaker 
guarantees and no support for uncertainty. 

When formal fully-automatic verification is infeasible, Runtime Verification 
also called Runtime Monitoring is an approach for monitoring the system’s 
states and steps at run-time for notable behavior such as violations of invari- 
ants that require a manual or automatic response. However, without look-ahead 
capabilities, potential near-future unsafe states cannot be detected. Therefore, 
some RV approaches such as integrate a behavioral model 
describing expected future evolutions of the system. In [45], the expected future 
evolutions of a Timed Automata (TA) are analyzed at run-time using BMC. In 
[15], Deterministic Timed Markov Chains modeling the system are analyzed at 
design-time to obtain expressions on step-probabilities that will become available 
at run-time to make probability-maximizing decisions at run-time by evaluat- 
ing the expressions at run-time instead of performing computationally expensive 
analysis. In [23], a run-time statistical model checking component has been inte- 
grated into a self-adaptive system. However, these approaches also rely on BMC 
and thereby suffer from state space explosion and in some cases such as 
also from being unable to react to uncertain events. 
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The approach of k-induction that has been adopted for variants of 
GTSs in rl establishes state invariants by symbolically applying GT rules 
backwards from unsafe states to accumulate context capturing why and how the 
symbolic violation could be reached. This approach is thereby a symbolic version 
of backward BMC. We use a similar approach in this paper tackling the problem 
of a large number of undesirable backward steps constructed by k-induction. 

A combination of forward and backward BMC similar to our approach for 
the analysis of Hybrid Automata in applies depth first search forward and 
backward in parallel to find paths to unsafe states for Hybrid Automata with 
complex state space structure. 

SCT as established in for capturing, analyzing, and synthesizing 
supervisory control when the controllers, the plants, and their closed loops are 
given by regular languages over events (see also for an in-depth intro- 
duction and a discussion of derived approaches) has to our knowledge not been 
combined with event-priorities. However, priorities have been used to combine 
supervised modules preventing blocking situations in [6] [7]. Also, approaches 
in the Model Predictive Control domain (see for a survey) employ mod- 
els to predict the future system behavior as in our approach but focus usually 
on continuous time systems minimizing costs as in and have not been 
combined with SCT to the best of our knowledge. Besides the approach to dis- 
tinguish between controllable and uncontrollable events as customary in SCT, 
other approaches of identifying actions of different actors and capturing inter- 
actions among such actors in the GT domain include [o] but also SCT for TA 
(related to above) has been considered in [42]. where 
a safety constraint has already been violated due to uncertainty or adversarial 
effects requiring the derivation and execution of recovery mechanisms. 


4 Priority-aware Supervisory Control Theory 


We recall SCT as introduced in the seminal work of Ramadge and Wonham 
in which the closed loop is given by the event-synchronizing composition 
of controller and plant. To provide the essentials of this approach in our notation 
and to extend this approach with the concept of event priorities, we introduce a 
variant of Labeled Transition Systems (LTSs) extending finite automata thereby 
capturing regular languages over an event alphabet as considered in standard 
SCT. In such an LTS, events are grouped into controllable and uncontrollable 
events (cf. the MAPE-K closed-loop in [Figure la}, which are executed by the 
controller (e.g., signals to effectors) and the plant (e.g., signals from sensors). The 
controller may restrict the execution of controllable events in the closed-loop. 
We aim at controller synthesis such that event-prevention ensures that the 
closed-loop avoids undesirable states (this notion is formalized below as non- 
blockingness) and no steps executing uncontrollable events have been prevented 
at the model level (this notion is formalized below as controllability) while 
not preventing event executions unnecessarily to retain the highest possible 
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degree of freedom for further control steps] We equip events with a prior- 
ity as motivated in the next section by our running example: steps executing 
(un)controllable events are then only enabled when no steps executing higher- 
priority (un)controllable events are enabled (i.e., priorities are checked within 
the two groups of controllable and uncontrollable events separately). 


Definition 1 (Labeled Transition System (LTS)). A Labeled Transition 

System (LTS) T contains the following components. 

e states(I") contains all states and its subsets start(I’), safe(I”), and unsafe(I’) 
contain the start, safe, and unsafe states. 

e events(I") contains the controllable and uncontrollable events eventsC(I") and 
eventsUC(I’). 

e prio(I’) : events(I’) + N assigns a priority to each event. 

e steps(I") C states(I’) x events(I") x states(I") is a set of event-labelled steps. 

Moreover, I, is a sub-LTS of Ig, written I, < Ia, when the components of 

I, are contained in the corresponding components of Ig and the reversed LTS 

rev(I") is obtained by reversing steps(I’) and swapping start(I’) and unsafe(I’). 


The priority-resolved LTS is obtained by omitting all controllable/uncontrol- 
lable steps disabled by higher-priority controllable/uncontrollable steps. Only 
the paths through this priority-resolved LTS can actually be observed. 


Definition 2 (Priority-resolved LTS). For an LTS I and a set of events E, 
I” = resPrio(I, E) is the largest sub-LTS of I such that for all (s,e1,51) € 
steps(I”) with e1 E€ E there is no (s,€2,52) € steps(I’) with eg € E and 
prio(I”)(e2) > prio(I”)(e1). Then, the priority-resolued LTS of I is given by 
resPrio(I’) = resPrio(resPrio(I, eventsUC(J’)), eventsC(I’)) F] 


A controller Tc to be synthesized for a given plant Ip is a sub-LTS of Ip and, 
hence, the event-synchronizing closed loop of To and Ip is just To. 

The notion of controllability requires that the controller cannot prevent un- 
controllable events that the plant can execute. 


Definition 3 (Controllability). A plant lp and a controller l'o < Ip satisfy 
controllability, if every path m of resPrio(I'c) that can be extended by resPrio(I’p) 
with a step executing an uncontrollable event u € eventsUC(I’p) can be extended 
by resPrio(Ic¢) with a step executing u as well. 


The notion of non-blockingness requires the liveness property that the closed 
loop may eventually reach a safe state from any of its states. In our approach, 
we define unsafe states as those violating a state invariant and safe states as 
those not having paths to any unsafe states. 


Definition 4 (Non-blockingness). A plant Ip and a controller T'o < Ip 
satisfy non-blockingness, if every path n of resPrio(I'c) can be extended to a 
state in safe(I"p). 


2 Note that controllers can only force certain events in a given state in this framework 
when all events executable from that state are controllable (differing from, e.g., ). 
3 Note that, in general, resPrio(I”) 4 resPrio(I, events(I’)). 
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For the case of controllers and plants generating regular languages considered 
here, admissible controllers satisfying controllability and non-blockingness are 
closed under arbitrary unions Desired controllers are therefore 
defined as those admissible controllers that result in the largest closed loops in 
terms of sets of executable event sequences. Admissible controllers are also closed 
under arbitrary union in the presence of event priorities because the union of 
controllers will result in a controller that favors the highest priority steps from 
any of the controllers and, moreover, LTSs are memoryless (beyond their current 
state) implying that choosing higher priority steps from different controllers can 
not lead to states not traversable using any of the controllers. However, only the 
priority resolved versions of synthesized controllers for which the classic results 
from readily apply are to be used anyway. 

Following SCT, the first controller candidate is the plant LTS I’. This candi- 
date is then incrementally refined by preventing events enforcing controllability 
and non-blockingness least-restrictively until an admissible controller control(I’) 
is obtained (closedness under arbitrary union also implies that the order in which 
violations of controllability and non-blockingness are resolved is insignificant). 
Note that this fixed-point procedure supports also cyclic LTSs in general (in 
which, as usual, loops may delay the visiting of safe states indefinitely as op- 
posed to [55]). To handle the case with priorities, we resolve priorities among 
uncontrollable events before applying the fixed-point procedure and resolving 
priorities of remaining controllable steps afterwards to obtain the priority-aware 
controller pControl(I’). 


Definition 5 (Priority-Aware Controller). An LTS T induces the LTS I’ = 

control(I") by adapting I as follows|*| 

e steps(I”) is the largest subset of steps(I’) such that for each (s,e1,51) € 
steps(I”) (non-blockingness) there is some path from sı to a state in safe(I’) 
using steps in steps(I’) and (controllability) when (s1, uz, $2) € steps(I" 
a step using an uncontrollable event ug from eventsUC(I’) then (s1, u2, s2 
also a step in steps(I”). 

Moreover, pControl(I7) = resPrio(control(resPrio(I’, eventsUC(I’))), eventsC(I")) 

is the priority-aware controller for I’. 


As an example for controller synthesis, consider the LTS in [Figure 2|representing 
an uncontrolled plant and the priority-aware controller synthesized for it P| First, 
to resolve blocking at s4, the controllable priority 2 event cz from sọ is prevented 
enabling the priority 1 event cı from so. Second, to resolve blocking at s3, the un- 
controllable event uc3 from sı is prevented. Third, to resolve non-controllability 
at sı, the controllable priority 1 event cı from so is prevented enabling the pri- 
ority 0 event uc; from sọ. The resulting controller will only contain the path 
from so to sg executing the event uc,. Note that maintaining the steps of all 
priorities in the LTS simplifies controller synthesis since the effect of preventing 
controllable events (such as c2 and c1) becomes apparent immediately without 


ya 
ji 


as 


wn 


4 For brevity, we omit here the removal of unreachable states from I”. 
5 When resolving priorities among uncontrollable events and later among controllable 
events no steps are removed in this example. 
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Fig. 2. Example of controllability and non-blockingness. The unsafe states {s3, s4} are 
given in red with dotted border, the safe state s2 is given in green with exiting arrow 
symbol, the remaining orange states have paths to unsafe states, the start state so 
has an entering arrow symbol, the bold steps execute the uncontrollable events uc;, the 
non-bold steps execute the controllable events c;, the dashed steps have been prevented, 
the event c2 has priority 2, the event cı has priority 1, the other events have priority 0, 
and only the boxed event uci can be executed since the steps executing {c1, c2} have 
been prevented. 


the need to derive such steps intermittently for then enabled steps (e.g., only 
the step executing c was enabled initially due to its priority) decoupling LTS 
generation and control synthesis. 

Note that control(resPrio(I’)) # resPrio(control(I”’)) in general because first 
resolving the priorities restricts the possible controllers to be synthesized. For 
example, first resolving priorities in [Figure 2] would remove the step with the 
event uci, which would otherwise be the only remaining step. 


5 Control-oriented Graph Transformation 


We first introduce control-oriented GTSs before discussing the modeling of our 
running example using this formalism. 

To ease presentation, we_employ the simple class of typed directed graphs 
(short graphs) (see for details). In our running example, we employ 
the type graph TG from which can be understood to be a simple 
UML class diagram, and graphs, which can be understood to be simple UML 
object diagrams. In visualizations of graphs such as [Figure 3b] types of nodes are 
indicated by their names (i.e., S; and T; are nodes of type Shuttle and Track), 
names of edges are omitted, types of edges are only given when required to avoid 
ambiguity (the only edge types with equal source and target node types are fast, 
slow, and halt). We denote monomorphisms (monos) from graph H to graph H’ 
mapping nodes and edges injectively by f : HH’. 

To introduce control-oriented GTSs, we first introduce GT rules used to 
derive GT steps between graphs. A Graph Transformation (GT) rule p consists 
of two monos l: K= L and r : K= R describing the removal and addition of 
elements and a set N of monos n; : L— N; of Negative Application Conditions 
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(NACs) describing forbidden extensions of LẸ] We use the abbreviation Ihs(p) = 
L later on. In visualizations of GT rules (see [Figure 3), we use an integrated 
notation in which L, K, and R are given in a single graph where graph elements 
marked with © are from L— K and will be deleted, graph elements marked with 
® are from R—K and will be created, and where all other graph elements are in 
K and will be preserved. When NACs are present, they are given on the left side 
of the > symbol. For example, consider the GT rule in [Figure 3c|which preserves 
the ambulance and shuttle nodes A; and S1, removes the edge from Sı to Aj, 
creates an edge from A, to Sı, and is only applicable when A, has no edge to 
some road node Ry. 

We now introduce our novel notion of control-oriented GTSs. Such a GTS 
S contains a set start(S) of start graphs, a set unsafe(.S) of unsafe graphs rep- 
resenting violations of invariants, a set rules(S) of GT rules with the subsets of 
controllable and uncontrollable GT rules rulesC(S) and rulesUC(S), and a map- 
ping prio(S) assigning a natural number as a priority to each GT rule. Note that, 
similarly as in our presentation of SCT in we assign priorities to GT 
rules and group them into controllable/uncontrollable GT rules capturing which 
steps can/cannot be prevented by the controller to be synthesized. 

GT steps G >, G” from a graph G to a graph G” are labeled with a pair 
o = (p,m) consisting of a GT rule p and a match m: |lhs(p)+G identifying 
an occurrence of lhs(p) in G. The match m must satisfy the requirement that 
there is no NAC n; : |hs(o) > N; contained in p for which some mj: N;G 
satisfying m/,on; = m exists. The graph G” is then constructed from G via the 
usual Double Pushout (DPO) diagram (see [13] [14] for a details). 

A GTS induces a forward LTS by deriving GT steps from already included 
graphs and adds these steps as well as their target states in the resulting LTS. 
Note that we merely propagate the priorities of the GT rules into the constructed 
LTS instead of enforcing them by excluding lower-priority steps when higher- 
priority steps are present. 


Definition 6 (Forward LTS of a GTS, BFSS). A GTS S induces the unique 

LTS T = [S] as follows: 

o states(I") contains start(I’) and the target states of all steps in steps(I’). 

start(I”) contains the graphs from start(S). 

safe(I") C states(I”) contains the graphs from which unsafe(I’) can’t be reached. 

unsafe(I") C states(I") contains the graphs G into which a mono t : H= G 

from some graph H € unsafe(S) exists. 

e eventsC(I’) and eventsUC(I") contain the step labels o = (p,m) of the steps in 
steps(I") where p € rulesC(S) and p € rulesUC(S). 

o prio(I’)(p,m) = prio(S)(p) assigns the priority of the used GT rule p. 

o steps(I") is the least relation containing all GT steps from states in states(I’). 

Moreover, the BFSS of depth n, denoted [S]n, is the largest sub-LTS of |S] in 

which all paths starting in start(I’) through distinct states have length < n. 


6 Our approach is orthogonal to the use of more expressive notions of application 
conditions such as nested graph conditions B)|14] halio] 
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(c) GT rule pap for postponing (d) GT rule pace for expected ambulance creation at 
ambulance creation. the farthest road segment from the crossing. 


(e) GT rule pacu for unexpected ambulance creation (f) GT rule pa moving the am- 
at some road segment (not on the crossing when bulance to the next road. 
there is a shuttle already). 
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(g) GT rules pp, psf, Pfs, Pss, and pps resulting (h) GT rules psn and ppn resulting in 
in a fast or slow shutle on the next track. a halted shuttle on the same track. 


GT rules controllable? priority Sre Sru Figure 


Pacp no 0 yes 
Pace no 0 yes 
Pacu no 0 no 
Pa no 0 yes 
Pfs; Pss, Phs yes 1 yes 
Pif, Pst yes 2 yes Figure 3g] 
Psh Phh yes 0 yes 5 


(i) Overview of the GT rules used in the GTSs Sre and Sfu. 


Fig. 3. Details on the running example. 
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We now discuss the modeling of our running example, which is a simplification 
of the case study considered in our evaluation in We model shuttles 
driving on a track topology where subsequent tracks are connected using next 
edges as in[Figure 3b] The driving speed of each shuttle is either fast, slow, or halt 
(as marked using fast, slow, or halt loops). Level crossings (where track and road 
topology intersect) are indicated by the node type Crossing and are connected to 
the corresponding track and road segments. Ambulances may appear and drive 
on the road topology including the level crossings. 


The graph in [Figure 3b] represents the current view of the shuttle on the 
system state. The ambulance A, is not yet connected to a road meaning that it 
can be ignored by the shuttle at this point. Ambulance and shuttle perform steps 
alternatingly by switching the directed edge between them in each step to ensure 
a certain level of fairness since the system would otherwise be fundamentally 
unsafe as the shuttle could not rule out collisions anymore. The edge from the 
ambulance to the shuttle indicates that the shuttle will perform the next step. 


Shuttles may maintain their speed (events ff, ss, and hh) or switch between 
fast and slow (events fs and sf) as well as between slow and halt (events sh and 
hs), modeling the stopping and acceleration distance. These seven driving speed 
transitions are controllable for the shuttle controller but all steps of ambulances 
are uncontrollable. To allow the shuttle to make timely control decisions, an 
ambulance detection mechanism informs the shuttle when ambulances are two 
roads ahead of an upcoming level crossing (i.e., an ambulance would be detected 
in [Figure 3b] when it enters the road R2). We derive shuttle control assuming 
that this detection mechanism is reliable but analysis will reveal partial robust- 
ness against unreliability in situations where ambulances are detected first on 
the closer road segments Rı or even Ro. Note that shuttle and ambulance per- 
forming steps alternatingly will result in violations of non-blockingness when the 
controller prevents all controllable steps of the shuttle in a given state, which is 
thereby implicitly excluded as well. 


We use GT rule priorities to model that the shuttle prefers faster driving 
speeds over slower driving speeds. Therefore, without preventing any steps, the 
shuttle will maintain its fast speed. 


We now discuss the GT rules used in these GTSs in more detail. Again, 
shuttle and ambulance steps alternate as implemented by switching the direction 
of the edge between them in every GT rule. When its the ambulances turn, the 
GT rules pace, Pacu, ANd Pacp are applicable when the ambulance has no edge to 
some road segment yet and the GT rule pa is used otherwise. The GT rule Pace 
models the expected creation of the ambulance by creating an edge from the 
ambulance to the road Rə in (the three NACs check that A, is not 
yet on Rı, that A, is not yet on some other road, and that the matched road 
R, has no predecessor). The GT rule pacu models the unexpected creation of the 
ambulance by creating an edge from the ambulance to an arbitrary road unless 
this road is at the level crossing with a shuttle being already located there as 
well (the three NACs check that A; is not yet on Rı, that A; is not yet on some 
other road, and that Sı is not on a track connected by a crossing to R,). The 
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GT rule pacp models the case that the ambulance is not yet created meaning that 
ambulance detection is postponed (the NAC checks that the ambulance is not 
yet on a road). Lastly, the GT rule pa models the moving of a detected ambulance 
to the next road segment (by removing the edge from A, to the current road 
segment Rı and creating such an edge to the road segment Rə reached). When 
its the shuttles turn, the GT rules pff, Pfs, Pst; Pss, Psh, Pns, aNd phh are used. The 
GT rules psh and pp, do not move the shuttle to the next track while the other 
GT rules do so. Here, the movement of the shuttle is implemented as for the GT 
rule pa by deleting and creating an edge and the driving speed transitions are 
encoded by deleting and creating the driving speed loop at the shuttle. 


In our running example, we first consider the GTS See with expected am- 
bulance detection: for this GTS, we employ the graph from [Figure 3b]as start 
graph, use 10 of the 11 GT rules from [Figure 3] split GT rules into controllable 
and uncontrollable GT rules, and employ priorities as listed in In 
particular, when its the ambulances turn, each enabled GT rule has the same 
priority 0 making all steps derivable using the GT rules pace and Pacp viable. 
When its the shuttles turn, GT rules setting the speed to halt, slow, and fast 
have priorities 0, 1, and 2 favoring a faster driving speed. Also, the GT rules for 
slowing down or remaining halted (pfs, Psh, and pph) cannot be prevented as this 
would lead to a violation of non-blockingness as discussed. Additionally, we con- 
sider a second GTS Sry in which ambulances are possibly detected closer or on 
the level crossing: this GTS differs from Spe by replacing the GT rule pace with 
Pacu for detecting an ambulance, which may result in up to four steps detecting 
the ambulance on any of the four road segments. 


In the considered GTSs, only a finite number of graphs can be reached and, 
in the remainder, we represent each graph using an element of {X¥,0,1,2,v7} 
x {0, 1,2,3,4, } x {f,s,h} x {s,a} where (a) X means that the ambulance has 
not been detected yet, 0-2 is the distance of the ambulance to the crossing, and 
V means that the ambulance has advanced beyond the crossing, (b) 0-4 is the 
distance of the shuttle to the crossing and W means that the shuttle has advanced 
beyond the crossing, (c) f, s, and h is the driving speed of the shuttle, and (d) s 
or a means that the shuttle or the ambulance performs the next step. The start 
graph from [Figure 3bļis therefore represented by X4fs as the ambulance has not 
yet been detected, the shuttle is four tracks away from the level crossing, the 
shuttle is in fast driving speed, and the shuttle will perform the next step. 


The 6 unsafe graphs in {0} x {0} x {f,s, h} x {s, a} of the considered GTSs Sre 
and Sry all contain a shuttle and an ambulance on the level crossing but differ 
in the three possible driving speeds of the shuttle and the two cases of which 
entity performs the next step. While we specify the set of all unsafe states in our 
GTS by providing it explicitly, unsafe states could also be identified using ad- 
vanced approaches such as nested graph conditions, Linear Temporal Logic |87], 
Computation Tree Logic lB, or Metric Temporal Graph Logic [49]. 


The controller to be synthesized should force the shuttle to drive fast un- 


less an ambulance is present, in which case the controller should ensure that the 
shuttle reaches the track Tı with slow speed and then halts there until the ambu- 
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lance has passed the level crossing. The controller synthesized by our integrated 
approach results in this controller as discussed subsequently. 


6 Design-time Control-synthesis 


We now discuss design-time control synthesis based on (a) BBSS generation 
from unsafe states and (b) control synthesis based on SCT together resulting in 
an LTS with unsafe boundary to be avoided at run-time to avoid unsafe states 
and a safe boundary for which the LTS is a controller avoiding unsafe states. 

For our running example, we start the BBSS generation using only two unsafe 
states Xo = {00sa, 00fa} for presentation purposes. We depict the obtained BBSS 
in which is constructed by adding up to k steps backwards from Xo. 
From all additional states X4, unsafe states in Xp can be reached by construction; 
to derive viable alternative steps avoiding unsafe states, we include all missing 
forward steps from states in X; to additional states X2. The states X> are by 
construction safe states (indicated by the exiting arrow symbol) of the resulting 
LTS from which unsafe states in Xo cannot be reached (within k steps). The 
start states of the constructed backward LTS are the last states traversed on each 
backward path (indicated by the entering arrow symbol). These start states will 
be grouped into the safe and unsafe boundary in the next step. 

We construct a controller from the BBSS given in by applying 
SCT. First, the two unsafe states 00sa and OOfa violate non-blockingness. To 
make these states unreachable, all five steps with one of them as a target are 
prevented resulting in a violation of non-blockingness at Olfs. To make this 
state unreachable, the step (11fa,a,0lfs) is prevented resulting in a violation 
of controllability at 11fa. To make this state unreachable, all three steps with 
11fa as target are prevented. Due to event-priorities, only the boxed events can 
be actually executed. Intuitively, the depicted controller ensures that, in the 
presence of an ambulance approaching the upcoming level crossing, the shuttle 
will avoid collisions, e.g., by halting in state Olha. When the ambulance is created 
unexpectedly closer to the crossing using pacu in Sru, the controller obtained 
here will fail since it would enter track Tı with fast speed when no ambulance 
is detected reaching state X1fa and then not be able to halt in front of the level 
crossing when the ambulance is then unexpectedly detected on the level crossing 
in the next step reaching state O1fs. 

Technically, we construct the BBSS for a given GTS relying on a secondary 
GTS called the backward GTS: We generate the BFSS for the backward GTS 
(according to|Definition 6), reverse the obtained LTS (according to 
and then add the missing forward steps to safe states as explained above. For 
our running example, we employ the backward GTSs Sge and Sgu, which can be 
obtained from their forward counterpart GTSs Spe and Sry by reversing their 
GT rules (see, e.g., Lemma 3.14] for rule reversal based on the L operation) 
and switching the sets of unsafe and start graphs. The reason for using a back- 
ward GTS is a reduced size of the BBSS, since (not simply using rule reversal) 
modeling the backward GTS separately (while still ensuring that it agrees with 
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Fig. 4. Design-time controller synthesis based on BBSSs. We reuse the notation from 
for start states, unsafe states, safe states, potentially unsafe states, steps 
executing controllable/uncontrollable events, and prevented steps. The depicted BBSS 
of depth 3 and the resulting synthesized controller for the GTS Sege (or the GTS Seu) 
based for brevity on only two of the six unsafe states. The two unsafe states can be 
avoided resulting in an empty unsafe boundary. 


the forward GTS as discussed in the next section) as in the case study consid- 
ered in þection 8|allows to enforce known system invariants (such as a minimum 
distance between level crossings or upper bounds of shuttles in certain areas) to 
reduce the number of derived steps. 


Definition 7 (Backward LTS of a GTS, BBSS). A (backward) GTS S 
induces the LTS T = [S]* by adapting I’ = rev([S]) as follows: 

o states(I") contains states(I’) and the safe states safe(I’). 

e safe(I") contains the target graphs of all steps in steps(I’) — steps(I”). 

o steps(I") contains steps(I’) and all GT steps from states in states(I”). 
Moreover, the BBSS of depth k, denoted [S]??“, is the largest sub-LTS of [S] 
in which all paths through distinct states ending in unsafe(I’) have length < k. 


We now apply the procedure pControl to the BBSS to derive the design-time 
controller. The unsafe boundary for which no suitable control could be derived 
is then given by all start states without an outgoing step and the safe boundary 
is given by the remaining start states (for which a controllable path to a safe 
state could be established). 


Definition 8 (Design-time Controller). If S is a (backward) GTS and k € 
N, then I = pControl([,S]>2**) is the design-time controller with unsafe boundary 
uBoundary(S, k) = {s € start(I”) | A(s,e, s’) € steps(I’)}. 


The design-time controller for the BBSS in [Figure 4]is constructed for k = 3 and 
has an empty unsafe boundary. However, when using k = 2 (removing the states 
in the first row and the safe states in the second row), we obtain a design-time 
controller with safe boundary {11ha, 11sa} and unsafe boundary {11fa}. 

As a further example, consider [Figure 5] in which the uncontrollable event 
acu is used by the GTS Sry for an unexpected shuttle detection leading to a non- 
empty unsafe boundary {X¥1fa}. In comparison, the controller obtained for Sfe 
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Fig. 5. Design-time controller synthesis with unexpected shuttle detection 


not assuming unreliable ambulances detection as in the step (X¥1fa, acu, 01fs) 
is robust by also avoiding (according to |Figure 4) the state Olfs preceding a 
collision in [Figure 5| Moreover, this controller is robust against ambulances ap- 
pearing unexpectedly directly on the crossing using the step (%2fa, acu, 02fs) 
unless the shuttle is already closer via step (¥1fa, acu, 01fs). Also, when an am- 
bulance appears one track ahead of the crossing, either no collision occurs (af- 
ter step (%2fa, acu, 12fs)) or the ambulance crashes into the shuttle (after step 
(X1fa, acu, 11fs)). 


7 Run-time Control-synthesis 


At run-time, we employ a given (forward) GTS Srs to derive the run-time con- 
troller as follows. First, we adapt Sps into St, by using the current state of the 
system as the unique start state and add uBoundary(Sgs, k) to the set of unsafe 
states. Second, we construct the BFSS of depth n (which is assumed to be main- 
tained throughout system execution as described in|section 2) for Sts. Third, we 
apply SCT to obtain the least-restrictive controller. 


Definition 9 (Run-time Controller). If S is the GTS obtained from the for- 
ward GTS as the adjustment to the current system state and the unsafe boundary 
of the design-time controller and n € N, then I = pControl([S]n) is the run-time 
controller with leaf set leafs(S,n) = {s € states(I’) | f(s, e, s’) € steps(I)}. 


We now discuss in more detail how our run-time control synthesis obtains an 
effective look-ahead of n + k steps towards unsafe states given by the n steps 
of I’ and the k steps of the design-time BBSS[] To this end, we first define a 
simulation relation to capture when a backward GTS such as Sge and Sgu for 
our running example is correct w.r.t. a forward GTS such as Spg and Sry for 
our running example. Since we do not consider the step labels (containing the 
GT rules or matches applied in these steps), we can understand this simulation 


T Our presentation also covers the special case where the backward GTS used at 
design-time is obtained by reversing the rules of the run-time GTS but also applies 
to backward GTSs that are designed for improved design-time efficiency and appli- 


cability (as mentioned before |Definition 7| in relation to k-induction discussed in 
and as elaborated in|section 8). 
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to be a weak simulation in which one step of the forward GTS is simulated 
(backwards) by the backward GTS using any number of GT steps. 

Definition 10 (Simulation Relation for GTS-based LTSs). Given two 
LTSs I and I" induced from GTSs according to [Definition 6] and [Definition 7 
A set R of morphisms fı : G+ G from states G} € states(I”) to states Gy € 
states(I") is a simulation relation from I to I", if for every (G2,a,G1) € steps(I’) 
capturing the forward GT span (go: Dı — G2, gı : Dı + G1) there is a sequence 
of GT steps (Go,07,,G@on—1),---(G@o1,04,G)) E€ steps(I”) that can be combined 
(using an iterated E-concurrent GT rule, Theorem 8.26]) into the backward 
GT span (g5: Di — Gh, g1 : D+ G) such that dı : D1 D, and fo: GL —> Go 
exist satisfying fo E€ R, foo gh = good, and fiog, =g ody. 

= aS 
fo = d = fi 


è, o h BG 

The following theorem then states that the existence of such a simulation relation 
R from the forward GTS to the backward GTS containing at least all embeddings 
of unsafe states V into the graphs reachable in the forward GTS within k steps 
is sufficient to ensure that any safety violation of the forward GTS within n to 
n + k steps is detected by checking the states reachable by n steps in [SfFs]n 
against the start states of [Sps]??*. Note that [Theorem 1|does not exclude 
spurious violation paths in terms of path pairs (71, 72) that are not composable 
to a path 7 of Sps due to application conditions in GT rules used in m or 7. 
Moreover, note that paths to unsafe states of length at most n steps are detected 
by constructing [Sfs]n already. 


Theorem 1 (Violation Detection). Given a forward GTS Sps, a backward 
GTS Sps, and an unsafe graph V contained in unsafe(S¢s) and unsafe(Sgs), 
every violation detected in |Srs]n+x in terms of some path n of length > n from 
start(Ses) to a graph containing V is correspondingly detected by the combined 
technique using [Ses]n and [Sps]P2** by two paths mı of length n from start( Srs) 
to a graph containing B and mz of length < k from some B’ (for which some 
b: B'— B exists) to the graph V whenever there is a simulation relation R from 
[Srs]x to [Spgs]? containing every mono f : V —>G into states G of [Srs]x- 


Proof (sketch). By induction on k, we derive the existence of an embedding of 
the last graph B of m2 into the last graph of mı ensuring that steps in 7 reaching 
a violating graph can be mimicked backwards via the simulation relation. 


This theorem thereby ensures that the system has an effective look-ahead of 
n+k steps at run-time towards unsafe states allowing it to derive suitable control 
decisions to avoid such unsafe states (if possible for that effective look-ahead). 


8 Evaluation 


As a case study, we now consider a more complex variation of the running ex- 
ample, including additional track features such as junctions, explicit modeling 
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—e— Forward to collision 


1,000,000 H -s— Forward to unsafe boundary 
—e— Backward from collision 


duration in ms 


0 2 4 6 8 10 12 
number of steps/effective look-ahead 


Fig. 6. Evaluation results. Look-ahead for “forward to collision”, effective look-ahead 
for “forward to unsafe boundary”, and depth of BBSS for “backward from collision”. 


of monitoring and signals (traffic lights for shuttles and ambulances). The used 
GTSs modeling this case study ensure that the sliding window perspective of the 
controlled shuttle is enforced by removing track and road segments behind the 
shuttle and enlarging the track/road topology forwards, potentially also includ- 
ing junctions, level crossing, and further components in a way to be expected by 
the shuttle. While we simply used the reversed rules for the backward GTSs in 
the running example, this would generate here for our case study, as for typical 
applications of the related approach of k-induction, a large number of unreal- 
istic track topologies that would need to be singled out using other techniques 
such as structural constraints reducing the applicability and performance of our 
approach at design-time. Applying [Theorem 1] we constructed a backward GTS 
with 31 GT rules by hand such that all steps of the forward GTS with 34 GT 
rules can be mimicked by at most two backward steps while minimizing the 
overapproximation of additional track topologies that are never reachable in the 
forward GTS. We used the tool Groove and provide the documented 
model files an explanation of our evaluation steps online| 

We evaluated the efficiency of our integrated approach in terms of consumed 
time by comparing it to the case where only a BFSS is constructed at run- 
timef| First, we use Groove to construct BFSSs of the forward GTS (for different 
bounds) thereby simulating the case where our approach is not used. Second, we 
use Groove to construct BBSSs of the backward GTS (for different bounds) also 
acquiring the unsafe boundary graphs thereby simulating the design-time aspect 
of our approach. Finally, we use Groove to construct the BFSS of the forward 
GTS (for different bounds) using the unsafe boundary graphs as target graphs 
(which means that the overhead of attempting to match the unsafe boundary 
graphs is included in our measurement) thereby simulating the run-time aspect 
of our approach. Generating the entire BFSS (for a given bound) instead of 
only adjusting it to the last observed step means that we consider the worst- 
case situation in which the entire BFSS is to be reconstructed due to, e.g., an 
unexpected step of the system. According to (forward to collision), 
the BFSS construction requires exponential run-time. In particular, collisions 


8 https: //github.com/OpenAcademicProject /Running-Example-of-Railway- 
Transportation-System 
° System: 64-bit Win10, Intel Core i7-6700HQ, 40GB RAM, Groove 5.8.1 
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are detected at depth 13 requiring 188 min, indicating that only using a BFSS 
may incur inacceptable costs at run-time. According to (backward to 
collision), the BBSS grows much slower compared to the BFSS because of (a) our 
usage of a separate backward GTS and (b) the restriction of considering paths 
that definitely lead to unsafe states. Hence, increasing the bound k for this BBSS 
is more advantageous compared to increasing the bound n for the BFSS in this 
scenario. Lastly, according to (forward to unsafe boundary), the first 
member of the unsafe boundary is found at run-time in the BFSS at depth 8 
requiring 8s with an effective look-ahead of 13 (as the depth 7 BBSS captures 5 
forward steps of the forward GTS), which is 1423 times faster. We conclude from 
our evaluation that the goal of shifting computation time (and memory costs) 
from run-time to design-time is achieved by a factor of 1423 for the case study. 

We note that applying our approach using a value k > 0 can increase the 
run-time cost. This would be the case when the forward/backward GTSs are 
constructed and the values of n and k are selected such that the time required 
for checking the leaf states of the run-time controller against the unsafe boundary 
of the design-time controller exceeds the time saved by generating at run-time 
a BFSS of depth n instead of n + k. This may be the case when, e.g., the BBSS 
contains a large number of infeasible paths (in the sense that the forward GTS 
cannot exhibit (instantiations of) them for the considered start states) resulting 
in an unsafe boundary containing a large number of states that can never be 
matched. While this issue did not arise for the case study considered here where 
run-time cost was decreased by a factor of 1423, this issue can be mitigated when 
it arises by employing assumed state invariants (capturing infeasibility of paths) 
to exclude states from the BBSS following the approaches in [1:7] [48] B]. 


9 Conclusion and Future Work 


In this paper, we presented a novel control-theoretic approach to run-time control 
for Graph Transformation Systems (GTSs) with priorities modeling large-scale 
systems with the threat of unexpected events. For the actor to be controller, we 
combine controllers synthesized at design-time and run-time with look-aheads 
n and k to obtain combined controllers with look-ahead n + k. An evaluation 
based on a shuttle transportation system shows a decrease of run-time compu- 
tation cost by a factor of 1423 compared to using only run-time controllers with 
the same look-ahead suggesting that our approach successfully shifts a large 
amount of run-time computation cost to design-time. Moreover, we exemplified 
the robustness of the devised controlled system against unexpected events. 

In the future, we will extend our approach to Interval Probabilistic Timed 
Graph Transformation Systems to model cyber-physical systems and the 
steps of the contained actors more precisely, incorporate techniques to minimize 
checking time against unsafe boundary nodes, and combine k-induction with 
hand-coded backward GTSs to obtain small Bounded Backward State Spaces 
(BBSSs) that are correct w.r.t. the forward GTS by design. 
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Abstract. Trusted execution environments (TEEs) have emerged as a 
key technology in the cybersecurity domain. A TEE provides an isolated 
environment in which sensitive computations can be executed securely. 
Trusted applications running in TEEs are developed using standardized 
APIs that many hardware platforms for TEE adhere to. However, formal 
models tailored to standard TEE APIs are not well developed. In this 
paper, we present a formal specification of TEE APIs using Maude. We 
focus on Trusted Storage API and Cryptographic Operations API, which 
are foundational to mobile and IoT applications. The effectiveness of 
our approach is demonstrated through formal analysis of MQT-TZ, an 
open-source TEE application for IoT. Our formal analysis has revealed 
security vulnerabilities in the implementation of MQT-TZ, and we patch 
and confirm its integrity using model checking. 


Keywords: Trusted execution environments : formal specification - 
formal methods - model checking - rewriting logic - Maude 


1 Introduction 


Trusted execution environments (TEEs) have emerged as a key technology in 
the cybersecurity of a wide range of software [I7]. They provide an isolated 
program execution environment where sensitive computations can be executed 
securely, shielding data from both software and hardware attacks. It guarantees 
the integrity, authenticity, and confidentiality of executed programs and their 
data. TEE is widely used in security-critical systems such as industrial control 
systems [5[7], servers [10], mobile security [11], IoT [I5], etc. 

However, the effectiveness of TEEs depends on their proper implementation 
and use. Inaccuracies or vulnerabilities can compromise the very integrity they 
seek to maintain; for example, user applications can access an unauthorized 
region of memory [12], or a kernel can be compromised using a stack-overflow 
attack [2]. This emphasizes the importance of the formal verification of TEEs. 
Through rigorous examination and validation, we can ensure the robustness of 
TEEs, ensuring they operate as intended and providing an additional layer of 
confidence in their ability to protect critical data. 


© The Author(s) 2024 
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The standardization of TEE is overseen by Global Platform [8]. Many systems 
that implement TEE, such as Samsung TEEgris, Trustonic Kinibi, Qualcomm 
QTEE, etc., adhere to this standard. The standard defines the API for trusted 
applications (TAs) to handle secure resources, such as memory and storage. 
These APIs are essential because they provide TEE services to applications 
running in a TEE. The uniformity of this API specification ensures compatibility 
across a wide range of applications, even when running on different CPUs. 

However, there is an evident deficiency in formal models tailored for TEE 
specification and its associated APIs. This gap is concerning because without 
rigorous verification and modeling, the integrity of TEEs could be compromised, 
potentially exposing vulnerabilities. In this paper, we address this concern by 
providing a comprehensive formal model of TEE APIs that is explicitly designed 
for the formal analysis of TEE applications. In this approach, we aim to provide 
a foundational tool that can serve the diverse spectrum of TEE applications and 
improve the overall security landscape of software. 


The architecture and behavior of Trusted Storage API, precisely defined in 
the standard [8], is quite complicated. Primarily, it arises from the stringent 
security requirement that each TA is assigned a dedicated storage, isolated and 
shielded from other TAs. For example, the function responsible for creating a file 
in TEE involves multifaceted processes, which is briefly illustrated in Section 
Such intricacies amplify the difficulty in developing a faithful formal model for 
TEEs, because of a huge representation gap between the informal (standard) 
specification [8] and a formal model to be developed. 

In this paper, we address challenge of the representation gap by leveraging 
a very expressive modeling language, called Maude [4], which supports powerful 
object-oriented specification. Since TEE API is mainly specified using objects 
and their interactions [8], it is appropriate to use such object-oriented modeling 
approaches to formally specify TEE APIs, making it much easier to develop 
a comprehensive formal model. We formalize important parts of TEE APIs, 
namely, Trusted Storage API and Cryptographic Operations API, which are 
central for trusted applications in mobile and IoT domains. 

We demonstrate the effectiveness of our approach for formally analyzing 
MQT-TZ [2021], an open-source TEE application that secures the IoT protocol 
MQTT. We have analyzed several security requirements of the implementation 
of MQT-TZ and found security vulnerabilities using model checking. We are able 
to fix a code-level bug and verify through model checking that the fixed program 
satisfies the previously violated requirements. 

This paper is organized as follows. Section [2] provides necessary background 
on trusted execution environments and Maude. Section [3] presents the formal 
object-oriented specification of Trusted Storage API in Maude. Section [4|presents 
the Maude specification of Cryptographic Operations API. Section [5] explains 
how TEE infrastructures, including trusted applications, can be specified in 
Maude. Section [6] presents a case study on analyzing various requirements of 
MQT-TZ and improving the implementation of MQT-TZ using our framework. 
Section [7] discusses related work. Section [8] presents some concluding remarks. 
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Fig. 1: Overview of the TEE Architecture. 


2 Preliminary 


Trusted Execution Environments. A trusted execution environment (TEE) uses a 
physically isolated storage and memory space to protect the security of program 
codes, executions, sensitive data, and so on. TEE is standardized by Global 
Platform [8], and many operating systems for TEE (e.g., Samsung TEEgris, 
Trustonic Kinibi, and Qualcomm QTEE) follow the standard. In particular, the 
standard defines the API for trusted applications to manage secure resources 
including memory and trusted storage. 

Figure [1] shows the overall architecture of TEE. Trusted applications (TAs) 
are secure applications running in TEE. In contrast, rich applications (RAs) are 
normal applications in REE. A trusted OS provides a collection of API functions, 
specified in the standard document [8], for TAs to perform secure operations. 
RAs perform secure services by invoking TAs, and the results of such requests 
are returned to RAs, through a dedicated hardware called a secure monitor. 


Maude. Maude [4] is a language and tool for formally specifying and analyzing 
concurrent systems. A Maude specification consists of: (i) an equational theory 
(X, E) specifying system states as algebraic data types, where X is a signature 
(i.e., declaring sorts, subsorts, and function symbols) and Æ is a set of equations; 
and (ii) a set of rewrite rules R of the form l: t > t’ if condition, specifying the 
system behavior, where l is a label, and t and t’ are terms [14]. 

In Maude, operators are declared with the syntax op f : S1...Sn -> S, 
where s1,..., Sn denote domain sorts and s denotes a range sort. Rewrite rules 
are declared with the syntax crl [l]: t => t if cond (or, for unconditional 
rules, rl [l]: t => t), where cond is a conjunction of equations. Similarly, 
equations are declared with the syntax ceq t = t’ if cond (or eq t = t’). 


A class declaration class C | attı : sı, ..., att, : Sn declares a class 
C with attributes att, to att, of sorts sı to sn. An instance of a class C is 
represented asa term < O : C | att, : v1, ..., attn : Un > of sort Object, 


where O is the object’s identifier, and v; is the value of each attribute atti. A 
subclass inherits the attributes and rewrite rules of its superclasses. A message 
is represented as a term of sort Msg. A global system state is a term of sort 
Configuration that has the structure of a multiset composed of objects and 
messages, where multiset union is denoted by juxtaposition (empty syntax). 
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Maude provides a number of formal analysis methods, including LTL model 
checking. Maude’s LTL model checker checks whether each behavior from an 
initial state satisfies a linear temporal logic (LTL) formula. A temporal logic 
formula is constructed by state propositions and temporal logic operators such 
as ~ (negation), /\, V, [] (“always”), <> (“eventually”), and U (“until”). 


K Framework. K [I6] is a rewriting-based framework for defining the semantics of 
programming languages, in which many languages, including C [6], Java [B], and 
EVM [9], have been successfully formalized. In K, program states are specified 
as multisets of cells, called K configurations. Each cell represents a component 
of a program state, such as computations, environments, and stores. Transitions 
between K configurations are defined by rewrite rules. 

A computation in K is defined as a ~~-separated sequence of computational 
tasks. For example, t1 ~ t2 ~... © tn represents the computation consisting of 
tı followed by tz followed by t3, and so on. A task can be decomposed into simpler 
tasks, and the result of a task is forwarded to the subsequent tasks. E.g., (5+) *2 
is decomposed into x œ 5 +0 ~ O x2, where LO is a placeholder for the result 
of a previous task. If x evaluates to some value, say 4, then 4a 5+0 00 «2 
becomes 5 + 4 œ O * 2, which eventually becomes 18. 

The following shows a typical example of K rules for variable lookup, where 
the k cell contains a computation, env contains a map from variables to locations, 
and store contains a map from locations to values: 


(LA we (EH ladeny (1 VU...) store 


lookup : F 


” 


A horizontal line represents a state change, and “...” indicates irrelevant parts. 
A cell without horizontal lines is not changed by the rule. By the lookup rule, if 
the first task in k is x, then x is replaced by the value v of x in its location l. 

K rules can be translated into ordinary rewrite rules [I6]. For example, the 
lookup rule can be written in Maude as follows, where environments and stores 
are declared as semicolon-separated multisets of assignments, and and K, ENV, 
and STORE are Maude variables that match the irrelevant parts: 


rl [lookup]: k(X ~> K) env(X |-> L ; ENV) store(L |-> V ; STORE) 
=> k(V ~> K) env(X |-> L ; ENV) store(L |-> V ; STORE) . 


3 Formal Specification of Trusted Storage API 


Trusted Storage API manages files and cryptographic keys in trusted storage. 
The architecture and behavior of Trusted Storage API [8] is summarized in 
Section Trusted Storage API is complex due to the security requirement 
that each TA’s storage is isolated and inaccessible to other TAs. We use Maude’s 
object-oriented specification to naturally specify the architecture as a collection 
of objects (Section and the behavior as rewrite rules (Section |3.3). 
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Fig. 2: The flow of TEE_CreatePersistentObject for the case of transformation. 


3.1 Overview of Trusted Storage API 


In the TEE API standard [8], resources such as files and keys are expressed 
as objects in an abstract way. A cryptographic object contains attributes, which 
are data used to store key material in a structured way. A persistent object 
represents a file associated with a data stream in its storage, and may also be 
a cryptographic object with attributes. A transient object represents an object 
with attributes in memory, but no data streams. Object handles are references 
that identify a particular object and contain access rights information. 

There are a total of 26 functions in Trusted Storage API. The persistent 
API functions can create, rename, and delete persistent objects and their data 
streams. The data stream API functions can read, write, truncate, or seek data 
from persistent objects. The transient API functions can allocate and deallocate 
transient objects, set, reset, or copy cryptographic keys to the objects, or generate 
random keys. In addition, these functions can open object handles for persistent 
and transient objects, respectively. 

To illustrate the complexity of Trusted Storage API, consider the function 
TEE_CreatePersistentObject, which creates a persistent object and returns the 
object handle. It first checks if a persistent object with the same name exists. 
Then, depending on the overwrite access flag, the operation either fails, or the 
object is deleted and recreated. A new persistent object can be created either as a 
cryptographic object or as a pure data object (without attributes). In the former 
case, attributes can be taken from another cryptographic object, or a transient 
object can be transformed to the persistent object. We describe the execution 
flow of transformation when a persistent object already exists, in Figure |2| The 
dashed box denotes deletion, and the dotted box represents creation. 


3.2 Representing Trusted Storage Objects in Maude 


Trusted Storage API can naturally be formalized in an object-oriented style. A 
cryptographic object is modeled as an instance of the class CryptoObj, where 
the attributes type, max-size, and usages denote the type, maximum size, and 
usages of a cryptographic key to be created, respectively; and attributes denotes 
cryptographic attributes. 


class CryptoObj | type : Type, max-size : Nat, usages : Set{Usage}, 
attributes : Set{CryptoAttribute} . 
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A persistent object is modeled as an instance of the class PersistObj, where 
the attribute file-name denotes the name of its file, and data-~stream denotes the 
associated data stream. Similarly, a transient object is modeled as an instance of 
the class TransObj, where initialized indicates whether the object is initialized. 
Both classes are declared as subclasses of CryptoObj, because they are both 
cryptographic objects according to the standard [8]. 


class PersistObj | file-name : FileName, data-stream : List{Data} 
class TransObj | initialized : Bool . 
subclass TransObj PersistObj < CryptoObj . 


A handle is represented as an instance of a subclass of the class Handle, where 
oid is the object that it points to. In particular, an object handle is represented 
as instances of the subclass ObjHandle, where flags contains data access flags. 


class Handle | oid : Oid . class ObjHandle | flags : Set{DataAccessFlag} 
subclass ObjHandle < Handle . 


The storage of each TA is modeled as an instance of the class Storage, 
where status denotes its status, files denotes the file names in the storage, 
and counter denotes a counter for creating a new identifier. 


class Storage | status : StorageStatus, files : Set{FileName}, counter : Nat . 


The kernel of each TA is modeled as an instance of the class TAKernel, where 
status denotes its status, storage denotes its storage, counter denotes a counter 
for creating a new identifier, and api-call denotes the status of an API call. 
The status of a TA can be normal, outOfMemory, or panic. 


class TAKernel | status : AppStatus, storage : Oid, 
counter : Nat, api-call : CallStatus . 


We represent an API function call as f(vl) # n of sort CallStatus, where 
f is a function identifier, vl is the call parameters, and (optional) n denotes the 
step of the call. The return of the call is represented as return(f,7rl), where rl 
denotes the return values. We use return(f) if there are no return values. 

The interactions between the objects are represented as the messages of the 
form msg r[vl] from Sender to Receiver, where r is the name of a request 
and vl is a list of arguments for the request. We use msg r from Sender to 
Receiver for the request with no arguments. For example, msg getStatus from 
TK to SI represents a request message from the TA kernel TK to its associated 
storage SI for returning the status with no arguments. 

The following example shows a TA and its associated storage, a transient 
object and its object handle, and a persistent object named file1. 


< tk : TAKernel | status : normal, id-counter : 1, storage : so, ... > 

< oh : ObjHandle | oid : to, flags : empty > 

< so : Storage | status : normal, files : fileName(’filel), counter : 1 > 
< to: TransObj | type : rsaKeyPair, max-size : 15, usages : decrypt > 

< po : PersistObj | file-name : fileName(’filel), type : rsaKeyPair, ... > 
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3.3 Specifying Trusted Storage API Behaviors 


Specification of TEE_ReadObjectData. This function takes a single parameter, 
a handle to a persistent object for data reading. A TA first checks the storage 
status by sending a message getStatus to an associated storage. When the 
storage receives getStatus, it returns its status using a message retStatus. 


rl [read-object-data-get-storage-status]: 
< TK : TAKernel | api-call : readObjData(HI), storage : SI > 
=> < TK : TAKernel | api-call : readObjData(HI) # 1 > (msg getStatus from TK to SI) 


rl [return-storage-status]: 
< SI : Storage | status : STATUS > (msg getStatus from TK to SI) 
=> < SI : Storage | > (msg retStatus[STATUS] from SI to TK) . 


If the storage status is normal, the TA sends a message read to the handle 
to request data reading. Otherwise, it returns the storage status. 


rl [read-object-data-storage-status-check]: 
(msg retStatus[STATUS] from SI to TK) 
< TK : TAKernel | api-call : readObjData(HI) # 1 > 
=> if STATUS == normal then 
< TK : TAKernel | api-call : readObjData(HI) # 2 > (msg read from TK to HI) 
else < TK : TAKernel | api-call : return(readObjData, STATUS) > fi 


When the handle receives read and has the flag accessRead, it reads the first 
data from the data stream of the persistent object. The data is returned to the 
TA using a message retData and the TA returns the received data. 


rl [read-object-data-from-persist]: 
< HI : ObjHandle | oid : PI, flags : (accessRead, FLAGS) > 
< PI : PersistObj | data-stream : DATA :: STREAM > (msg read from TK to HI) 
=> < PI : PersistObj | data-stream : STREAM > (msg retData[DATA] from HI to TK) 
< HI : ObjHandle | >. 


rl [read-object-data-success]: 
(msg retData[DATA] from HI to TK) 
< TK : TAKernel | api-call : readObjData(HI) # 2 > 
=> < TK : TAKernel | api-call : return(readObjData, DATA) > . 


Specification of TEE_CreatePersistentObject. Due to the page limit, we explain 
the rules used to specify the behavior in Figure[2| This function takes five param- 
eters: file name, access flags, a handle to another transient or persistent object, 
initial data, and an optional handle. A TA determines the method for creating 
a persistent object and sends a creation request to an associated storage. 


rl [create-persistent-determine-case]: 
< TK : TAKernel | api-call : createPersistent(FILE, FLAGS, HI, DATA, OPT), 
storage : SI > 
=> < TK : TAKernel | api-call : createPersistent(FILE, FLAGS, HI, DATA, OPT) #1 > 
mkCreationMsg(FILE, FLAGS, HI, DATA, OPT, SI, TK) . 


108 Geunyeol Yu, Seunghyun Chae, Kyungmin Bae, and Sungkun Moon 


The mkCreationMsg function determines the creation method and constructs 
a create message, where the first argument denotes the method id. If the handle 
is null, the message is for creating a pure persistent object. If both the handle 
and optional handle are not null, the message is for creating a persistent object. 
Otherwise, it’s for transforming a transient object into a new persistent object. 


op mkCreationMsg : FileName Set{DataAccessFlag} HandleId Data HandleId 
Oid Oid -> Configuration . 
eq mkCreationMsg(FILE, FLAGS, null, DATA, OPT, SI, TK) 
= (msg create[pure FILE FLAGS null DATA] from TK to SI) . 


ceq mkCreationMsg(FILE, FLAGS, HI, DATA, OPT, SI, TK) 
= if OPT == null 
then (msg create[transform FILE FLAGS HI DATA] from TK to SI) 
else (msg create[persist FILE FLAGS HI DATA] from TK to SI) fi if HI =/= null . 


When the storage receives the create message, it checks the existence of a 
persistent object with the same name from the storage. If the object exists and 
the access flags contain the overwrite flag, it proceeds by sending the create 
message to the persistent object. Otherwise, it informs TA with createFail. 


crl [create-persist-overwrite-check]: 
(msg create[METHOD FILE FLAGS HI DATA] from TK to SI) 
< PI : PersistObj | file-name : FILE > 
< SI : Storage | status : normal, files : FILES, counter : N > 
=> < PI : PersistObj | > 
if overwrite in FLAGS 
then < SI : Storage | counter : N+ 2 > 
(msg create[METHOD FILE FLAGS HI DATA N TK] from SI to PI) 
else (msg createFail from SI to TK) < SI : Storage | > fi if FILE in FILES . 


When the persistent object receives the create message with the transform 
method, it transforms the transient object into a persistent object, opens a new 
object handle, and deletes itself. Then, the handle is sent to the TA through the 
message createSuccess. The function newOid is used to create a fresh identifier. 


crl [create-persist-transform]: 
(msg createLtransform FILE FLAGS HI DATA N TK] from SI to PI) 
< HI : ObjHandle | oid : OI > 
< OI : TransObj | type : TYPE, usages : USAGES, max-size : M, 
attributes : ATTRS > 
< PI : PersistObj | file-name : FILE > 
=> < NEW-HI : ObjHandle | oid : NEW-PI, flags : FLAGS > 
< NEW-PI : PersistObj | type : TYPE, usages : USAGES, max-size : M, 
attributes : ATTRS, data-stream : DATA, 
file-name : FILE > 
(msg createSuccess[NEW-HI] from NEW-PI to TK) 
if NEW-HI := newOid(N, SI) /\ NEW-PI := newOid(N + 1, SI) . 


When the TA receives a createSuccess message with an object handle, it 
returns the handle. If receiving createFail or detecting insufficient memory, it 
returns a corresponding error. 
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rl [create-persist-success]: (msg createSuccess[HI] from PI to TK) 
< TK : TAKernel | status : normal, api-call : createPersistent(VL) > 
=> < TK : TAKernel | api-call : return(createPersistent, HI) > . 


rl [create-persist-fail]: (msg createFail from SI to TK) 
< TK : TAKernel | status : normal, api-call : createPersistent(VL) > 
=> < TK : TAKernel | api-call : return(createPersistent, errorAccessConflict) > . 


rl [create-persist-mem-err]: 
< TK : TAKernel | app-status : outOfMemory, api-call : createPersistent(VL) > 
=> < TK : TAKernel | api-call : return(createPersistent, errorOutOfMemory) > . 


4 Formal Specification of Cryptographic Operations API 


Cryptographic Operations API handles cryptographic algorithms by managing 
operation states. Cryptographic Operations API is also quite complex due to 
the internal operation states. This section shows that these difficulties can be 
effectively dealt with using Maude’s object-oriented specification. 


4.1 Overview of Cryptographic Operations API 


A cryptographic operation abstracts a cryptographic process. It has an operation 
state such as initial, active, or extract. An operation handle is a reference to a 
cryptographic operation. Each handle has a handle state, which is defined by 
whether a key is set, an operation is initialized, and data can be extracted. 

The API provides a total of 30 functions for various types of cryptographic 
primitives and schemes, including symmetric ciphers, authenticated encryptions, 
and key derivations. In addition, the generic operation API functions support 
the operations common to all types. These functions can allocate, free, reset 
cryptographic operations, and set cryptographic key. 

To illustrate the complexity of Cryptographic Operations API, consider the 
state diagram of symmetric ciphers, described in Figure |3| The operation can be 
started either with or without key (KEY_SET or not KEY_SET). If it has no key, 
TEE_SetOperationKey is used to set a key. Otherwise, it is initialized (INIT) by 
TEE_CipherInit. The operation can run the algorithm with TEE_CipherUpdate. 
After performing the operation, TEE_FreeOperation can be used to deallocate 
the operation or TEE_CipherDoFinal is used to finish and reset the operation. 
Figurd4]shows the state diagram of message digest, which is also complex. 


4.2 Representing Cryptographic Operations in Maude 


Cryptographic operations can naturally be modeled in an object-oriented style. 
We model cryptographic operations as instances of class CryptoOp. The attribute 
attributes denotes a set of CryptoAttribute, max-size is the maximum size 
of a key to use, and algorithm is the identifier of an algorithm to operate. The 
attributes mode, state, and opclass denote the mode, state, and class of the 
operation, respectively, and acc-data is a list of Data it holds. 
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Fig. 3: Symmetric cipher operation. Fig. 4: Message digest operation. 
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class CryptoOp | attributes : Set{CryptoAttribute}, max-size : Nat, 
algorithm : Algorithm, mode : Mode, state : State, 
opclass : OpClass, acc-data : List{Data} . 


Operation handles are represented as instances of the class OpHandle, which 
extends Handle. The attribute state is a handle state and key-material-set 
denotes whether cryptographic key materials are set to the operation. 


class OpHandle | state : HandleState, key-material-set : Bool 
subclass OpHandle < Handle . 


Specification of TEE_AllocateOperation. This function takes three parameters: 
an algorithm identifier, a mode, and the maximum key size. A TA first checks 
whether the algorithm and mode are compatible using the compatible function. 
If valid, it creates a new cryptographic operation, and opens and returns an 
operation handle. The function getClass is used to retrieve the algorithm class. 


crl [allocate-operation-success]: 
< TK : TAKernel | api-call : allocOperation(ALGO, MODE, MAXSIZE), 
status : normal, id-counter : N > 
=> < TK : TAKernel | api-call : return(allocOperation, HI), id-counter : N + 2 > 
< HI : OpHandle | oid : OI, state : noKeyNotInit, key-material-set : false > 
< OI : CryptoOp | attributes : empty, max-size : MAXSIZE, handle : HI, 
algorithm : ALGO, mode : MODE, opclass : getClass(ALGO), 
acc-data : nil, state : initial > 
if compatible(ALGO, MODE) /\ OI := newOid(N, TK) /\ HI := newOid(N + 1, TK) . 


If the algorithm and mode are not compatible or insufficient memory is de- 
tected, the TA returns a corresponding error, specified by the following rules: 


crl [Lallocate-operation-params-err]: 

< TK : TAKernel | api-call : allocOperation(ALGO, MODE, MAXSIZE) > 
=> < TK : TAKernel | api-call : return(allocOperation, errorNotSupported) > 
if not compatible(ALGO, MODE) . 


rl [allocate-operation-memory-err]: 
< TK : TAKernel | status : outOfMemory, api-call : allocOperation(VL) > 
=> < TK : TAKernel | api-call : return(allocOperation, errorOutOfMemory) > . 
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Specification of TEE__ResetOperation. A TA creates a resetOp message to reset 
a cryptographic operation. If the cryptographic operation receives a request and 
its key materials are set, it resets the operation state using the resetState 
function, clears the data, and notifies the TA using a message finishResetOp. 
The function resetState updates the state to initial if the state is active. 


rl [reset-operation-request-reset]: 
< TK : TAKernel | api-call : resetOperation(HI) > < HI : OpHandle | oid : CI > 
=> < TK : TAKernel | > < HI : OpHandle | > (msg resetOp[HI] from TK to CI) . 


rl [reset-operation-finish-reset]: 
< CI : CryptoOp | state : STATE > (msg resetOp[HI] from TK to CI) 
< HI : OpHandle | oid : CI, key-material-set : true > 

=> < CI : CryptoOp | acc-data : nil, state : resetState(STATE) > 
< HI : OpHandle | > (msg finishResetOp from CI to TK) . 


rl [reset-operation-success]: (msg finishResetOp from CI to TK) 
< TK : TAKernel | api-call : resetOperation(VL) > 
=> < TK : TAKernel | api-call : return(resetOperation) > . 


Specification of TEE_CipherUpdate. This function takes two parameters: an 
operation handle and input data. A TA creates a message reqCipher to request 
data encryption or decryption. When a cryptographic operation receives the 
message and key materials are set, it checks whether the operation can succeed 
using the cipherSuccess function. If successful, the operation runs the algorithm 
with runAlgo and returns a result to the TA using the finishCipher message. 
Otherwise, it reports failure using the failCipher message. 


rl [cipher-update-request-cipher]: < HI : OpHandle | oid : CI > 
< TK : TAKernel | api-call : cipherUpdate(HI, DATA) > 
=> < TK : TAKernel | > < HI : OpHandle | > (msg reqCipher[HI DATA] from TK to CI) . 


rl [cipher-update-try-cipher]: 
(msg reqCipher[HI DATA] from TK to CI) 
< HI : OpHandle | key-material-set : true > 
< CI : CryptoOp | attributes : ATTRS, algorithm : ALGO, mode : MODE, 
opclass : CLASS, state : STATE > 
=> < CI : CryptoOp | > < HI : OpHandle | > 
if cipherSuccess(ALGO, MODE, ATTRS, CLASS, STATE, DATA) then 
(msg finishCipher[runOp(ALGO, MODE, ATTRS, DATA)] from CI to TK) 
else (msg failCipher from CI to TK) fi . 


When the TA receives the encrypted or decrypted data from cipherSuccess, 
it returns the data. If receiving failCipher, it goes to panic. 


rl [cipher-update-success]: (msg cipherSuccess[VALUE] from CI to TK) 
< TK : TAKernel | api-call : cipherUpdate(VL) > 
=> < TK : TAKernel | api-call : return(cipherUpdate, VALUE) > . 


rl [cipher-update-panic]: 
< TK : TAKernel | api-call : cipherUpdate(VL) > (msg failCipher from CI to TK) 
=> < TK : TAKernel | status : panic > . 
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5 Formal Specification of a TEE Infrastructure 


5.1 Representing Rich and Trusted Applications in Maude 


Thanks to the K semantics, we can model RA and TA to run programs, written 
in any programming language. Applications are represented as instances of the 
following class App, where prog denotes a program and proc is a K configuration 
for the program execution. RAs and TAs are modeled as instances of the classes 
RA and TA, respectively. Both classes inherit App but TA also inherits TAKernel. 


class App | prog : Program, proc : KConfig . 


class RA . class TA . 
subclass RA < App . subclass TA < App TAKernel . 


In this paper, we define K rewrite rules for a subset of the C language, in- 
cluding function calls, variables, assignments, loops, and conditional statements. 
As mentioned in Section |2| the K semantics can be written in Maude. 

For TEE API function calls, we use TAKernel to handle them. When a TEE 
API function FUNC is called with parameters VL, a TA pushes the call to api-call 
and adds a task $wait( f), representing the task waiting for the function f. Then, 
a TAKernel handles the call as explained in Sections |3| and |4| The isTeeApi 
function is used to check whether a function is a TEE API. 


crl [tee-api-call]: 

< TI : TA | proc : (k(FUNC(VL) ~> K) KS) > 
=> < TI : TA | proc : (k($wait(FUNC) ~> K) KS), api-call : FUNC(VL) > 
if isTeeApi(FUNC) . 


After the TAKernel handles the call, the TA assigns the return values to the 
function’s output variables. We use $out (al) to denote output variables æl. The 
makeRetStmt function is used to create statements for assigning variables. 


crl [tee-api-call-return]: 
< TI : TA | proc : (k($wait(FUNC) ~> $out(XL) ~> K) KS), 
api-call : return(FUNC, VL) > 
=> < TI : TA | proc : (k(STMT ~> K) KS), api-call : noCall > 
if isTeeApi(FUNC) /\ STMT := makeRetStmt(VL, XL) . 


5.2 Representing Execution Environments 


We represent the two separated execution environments as a pair {Sr} | [Sr], 
where Spr contains RAs and Sr contains TAs, together with objects and messages 
introduced in Sections [3] and M| Trusted OS is represented as an instance of 
the class TrustedOS, where sess is a map from SessionId to Oid. Sessions are 
communication channels between RA and TA. 


class TrustedOS | sess : Map{SessionId,Oid} 
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We specify the communications between an RA and a TA using Maude rules. 
The RA calls the TA using a secure monitor call (SMC). We define its semantic 
using the following rule. A message smcReq represents an SMC and the function 
makeSmcArgs makes SMC arguments. 


crl Linvoke-ta]: 
< RI : RA | proc : (KCFUNC(VL) ~> K) KS) > 
=> < RI : RA | proc : (k($wait(FUNC) ~> K) KS) > smcReq(ARGS) 
if isInvokeFunc(FUNC) /\ ARGS := makeSmcArgs(RI, FUNC, VL) . 


The secure monitor accepts the SMC request by transferring the message 
smcReq from REE to TEE. Later, it gets a result from TEE through a message 
smcRet and finishes the request by transferring the message to REE. 


rl [Laccept-smc-request]: {REE smcReq(ARGS)} | {TEE} => {REE} | {TEE smcReq(ARGS)} . 
rl [return-smc-request]: {REE} | {TEE smcRet(ARGS)} => {REE smcRet(ARGS)} | {TEE} . 


We define the behavior of a trusted OS when receiving smcReq. The OS 
invokes a target TA using an invkTa message. The function getTargetTa is used 
to extract the target TA from SMC arguments and getRequestor is used to get 
the RA’s identifier. 


crl Laccept-smc-request]: 
< OS : TrustedOS | sess : SM > smcReq(ARGS) 
=> < OS : TrustedOS | > invkTa(TI, RI, ARGS) 
if RI := getRequestor(ARGS) /\ TI := getTargetTa(ARGS, SM) . 


When the target TA receives invkTa and is not running, it executes a program 
using the function run. For example, run(p, f,vl) executes the function f of a 
program p with arguments vl. The functions getFunc and getParams are used 
to get a function identifier and call parameters from SMC arguments. 


crl [handle-invoke-ta]: 

< TI : TA | proc : none, prog : P > invkTa(TI, RI, ARGS) 
=> < TI : TA | proc : run(P, F, VL) > invkTa(TI, RI, ARGS) 
if F := getFunc(ARGS) /\ VL := getParams(ARGS) . 


After the execution, the TA gets a result from proc using the function getRes 
and creates an invkTaRet message. Then, the trusted OS creates an smcRet 
message for sending the result to the secure monitor, which is transferred to 
REE. The function finished checks whether the process is finished. 


crl [handle-invoke-ta-finish]: 
< TI : TA | proc : KS > invkTa(TI, RI, ARGS) 
=> < TI : TA | proc : none > invkTaRet(RI, RV) 
if finished(KS) /\ RV := getRes(KS) /\ RI := getRequestor(ARGS) . 


crl [Lreturn-smc-request]: 
< OS : TrustedOS | > invkTaRet(RI, RES) => < OS : TrustedOS | > smcRet (ARGS) 
if ARGS := makeSmcArgs(RI, RES) . 


When the RA receives the message smcReq with the result, it finishes the 
secure monitor call using the function makeRetStmt. The function retVal is used 
to get return values from smcRet. 
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crl Linvoke-ta-finish]: 
< RI : RA | proc : (k($wait(F) ~> $out(XL) ~> K) KS) > smcRet(ARGS) 
=> < RI : RA | proc : (k(STMT ~> K) KS) > 
if RI == getRequestor(ARGS) /\ VL := retVal(ARGS) /\ STMT := makeRetStmt(VL, XL) . 


6 A Case study on Formal Analysis of MQT-TZ 


This section shows the effectiveness and feasibility of our formal model using 
MQT-TZ PIJ, a TEE-based implementation of the message transport protocol. 
We defined LTL properties for MQT-TZ (Section (6.1), formally analyzed them 
with threat models, and proposed a patch (Sections and (6.3). Our formal 
specification, case study model, and experimental results are available in [25]. 


6.1 Overview of MQT-TZ 


MQT-TZ [2] is a secure topic-based publish-subscribe protocol utilizing TEE. 
Figure [5] illustrates the overall architecture, presenting three entities: publisher, 
subscriber, and broker. Publishers collect, encrypt, and send data as messages 
to a broker’s topic. A subscriber can receive these messages by subscribing to 
a topic. Brokers manage topics, subscriptions, and message delivery from pub- 
lishers to subscribers. Each broker is implemented using TEE, consisting of a 
single RA and TA. The RA retrieves publisher messages and calls the TA for 
re-encryption or forward re-encrypted messages to subscribers. 

The re-encryption is a key mechanism for protecting messages from potential 
threats. It ensures that messages cannot be exploited, allowing only the intended 
subscribers to read. This can be accomplished as follows: (i) Clients (publishers 
and subscribers) generate symmetric keys and securely share them with brokers 
using TLS, (ii) The publishers encrypt messages with their keys, and (iii) The 
brokers decrypt the messages using the publisher’s keys and re-encrypt them 
with the subscriber’s keys in TEE. 

To analyze MQT-TZ, we define various requirements and express them as 
LTL properties. These properties are summarized in Table|1} The properties P1 
to P5 represent requirements for correctness of message reception (P1, P2, and 
P3), system integrity (P4), and robustness of message sending (P5). P6 is for 
checking whether the MQT-TZ scenarios satisfy the basic invariant. 


REE | TEE Broker 
a RA HL TA TransObj J Jomo] 


i (es Data Stream L- PersistObj J 


Fig. 5: Overview of MQT-TZ. 
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Table 1: The LTL properties for MQT-TZ. 


Prop. Description LTL Formula 

P1 If no memory error occurs in the broker, amemErr.B > 
subscribers eventually receive messages. (send.P + Orecv.S) 

P2 If the TA panics, subscribers should not receive panic.TA > O ~recv.3) 
any messages. 

P3 If any memory error occurs in the broker, memErr.B — O -recv.3) 
subscribers should not receive any messages. 

P4 When the TA starts running, it should start. TA — term. TA) 
eventually terminate. 

P5 If subscribers receive messages from publishers, inQueue.P(a :: b :: c) > 
messages sent from each publisher are in order. OinQueue.S(a:: b :: c)) 
Th ber of tasks handled by the TA t 

P6 e number of tasks handled by the canno -numTaskExceed(5)) 
exceed five. 


For formal analysis, we represent MQT-TZ’s entities (brokers, publishers, 
and subscribers) as Maude objects. We model brokers as instances of the Broker 
class, which is a nested object with the execution environments of Section [5] for 
running RA and TA, along with a buffer for storing publisher messages and a 
subscriber list. Publishers are modeled as instances of the Publisher class, which 
has a list of collected data to be sent to brokers. Subscribers are represented as 
instances of Subscriber, which has a list of received messages from brokers. 

We specify the behavior of clients and brokers, depicted in Figure |5| For 
publishers, we define their behavior with two rules: collecting data, and sending 
it to brokers with encryption. The behavior of subscribers is represented by 
a single rule for message reception. We specify the behavior of a broker RA 
using the following rules: (1) capturing publisher messages and storing them 
in a message buffer, (2) running the MQT-TZ RA program, which calls a TA 
(explained in Section B), and (3) receiving re-encrypted messages from the TA 
and sending them to subscribers. 

For a broker RA and TA, we obtained their C programs from the MQT-TZ 
Github repository. To run them in our model, we translated a total of 1200 lines 
of C codes to our C-subset language using a simple translation script. Figure [6] 
shows the TA’s re-encryption function before the conversion. 


6.2 LTL Model Checking 


We have performed LTL model checking for the properties in Table[l] considering 
two threat models. We use the following scenario for the analysis: 


— Two subscribers (subi, subg), two publishers (pub;, pub2), and one broker 
participate, where the broker has two topics. 

— sub; subscribes to a single topic, while subz subscribes to all topics. 

— pub, sends a single message, while pub2 sends two. 
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static TEE_Result 
payload_reencryption(void «session, 
uint32_t param_types, 
TEE_Param params[4]){ 
TEE_Result res; 
uint32_t exp_param_types = 
TEE_PARAM_TYPES(. 
TEE_PARAM_TYPE_MEMREF_INPUT , 
TEE_PARAM_TYPE_MEMREF_INOUT , 
TEE_PARAM_TYPE_MEMREF_INOUT , 
TEE_PARAM_TYPE_VALUE_INPUT) ; 


if (param_types != exp_param_types) 


if (set_aes_key(session, ori_cli_key) 
!= TEE_SUCCESS){ 
res = TEE_ERROR_GENERIC; 
TEE_Free((void *) ori_cli_key); 
goto exit; 


} 


if (cipher_buffer(session, 
(char *) params[@].memref.buffer 
+ TA_MQTTZ_CLI_ID_SZ + TA_AES_IV_SIZE, 
data_size, dec_data, &dec_data_size) 
|= TEE_SUCCESS){ 

res = TEE_ERROR_GENERIC; 


return TEE_ERROR_BAD_PARAMETERS; 
i goto exit; 
} 
if (alloc_resources(session, iis 
TA_AES_MODE_DECODE) 
!= TEE_SUCCESS){ TEE_Free((void *) dec_data); 
res = TEE_ERROR_GENERIC; exit: 
goto exit; return res; 


3 3 


Fig. 6: The C code of the TA’s re-encryption function. 


Threat models. We consider two threat models: an out-of-memory threat and 
a message modification threat. The out-of-memory threat nondeterministically 
changes the status of a TA to outOfMemory. The message modification threat 
represents a compromised broker [2I] that calls a TA with incorrect arguments. 
We specify the threats using Maude. For the out-of-memory threat, we model 
the threat as a single rewrite rule as follows. 


normal > 
outOfMemory > . 


TAKernel | status : 
TAKernel | status : 


rl [Lout-of-memory-threat]: < TK : 
=> < TK : 


For the message modification threat, we model an intruder as an instance 
of the Intruder class with a single attribute subs-list, denoting a broker’s 
subscription list. Prior to the attack, the intruder learns the subscription list 
of a target broker from the messages in the broker’s REE and records this in 
subs-list. After learning, the intruder uses this information and modifies any 
incoming messages of the broker by replacing the sender with any one of its 
subscribers. We can model this attack behavior as follows. The modify function 
replaces the SENDER in a publisher message mqttzMsg to another subscriber using 
the learned subscription list SUBS-LIST. 


rl [message-modification-threat]: (mqttzMsg [DATA|TOPIC] from SENDER) 
< INT : Intruder | subs-list : SUBS-LIST > 
=> < INT : Intruder | > modify(DATA, TOPIC, SENDER, SUBS-LIST) . 


Model checking experiment. We consider the following threat scenarios: without 
any threats (NON), with the message modification threat (MSG), and with the 
out-of-memory threat (00M). We measure the size of the state space (|S) in 
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Table 2: The results of LTL model checking. 
Prop. Type Safe? |S| Time Prop. Type Safe? |S| Time Prop. Type Safe? |S| Time 


NON T 62 35.7 NON | 62 35 NON T 62 33.8 
P1 MSG T 148 90.1 P3 MSG T 148 88.8 P5 MSG T 148 86.9 
OOM T 202 144.2 oom L 01 0.1 OOM T 532 546.7 
NON T 62 34.9 NON T 62 34.9 NON T 62 34.3 
P2 MSG L 17 91 P4 MSG T 148 886 P6 MSG T 148 87.9 
OOM T 532 547.9 OOM T 532 539.3 OOM T 532 542.4 


thousands, the model checking result (Safe?), and time in seconds. The T and 
L denote the property is safe and violated, respectively. We use the Maude 
model checking command for the analysis, which provides counterexamples for 
violations. We run the experiment on Intel Xeon 2.8GHz with 256 GB memory. 

As summarized in Table[2| the two properties P2 and P3 are violated under the 
threats, indicating the possible vulnerabilities. By analyzing the counterexample 
of the P2 violation, we have discovered that the TA can panic during the message 
re-encryption. This occurs because the sender of a message can be modified, 
leading the TA to decrypt the message with an incorrect sender’s key. For the 
P3 violation, we have found that when insufficient memory is detected, the TA 
finalizes the re-encryption with an error and returns a re-encrypted message 
containing (dummy) data. In this case, the RA does not verify whether the TA 
returns a correct re-encrypted message and continues to transmit the message 
to subscribers, which results in obtaining the message containing dummy data. 


6.3 Patching the MQT-TZ Vulnerabilities 


To fix the identified vulnerabilities, we have implemented code-level patches 
for both the MQT-TZ RA and TA, as illustrated in Figure [7| Newly added 
patches are highlighted in red, while the original codes are depicted in black. 
The left side shows the patch for RA, and the right side is for TA. For the TA, 
we modify it to inform the RA of a memory error or panic. In the case of the 


TEEC_Result static TEE_Result 
void main(struct test_ctx *ctx, payload_reencryption(void *session, 
mqttz_client xorigin, mqttz_client dest, uint32_t param_types, 
mqttz_times xtimes) { ... TEE_Param params[4]){ 
res = TEEC_InvokeCommand(&ctx->sess, eee 
TA_REENCRYPT , if (alloc_resources(session, 
&op, &ori); TA_AES_MODE_DECODE) 
if (res == TEE_ERROR_OUT_OF_MEMORY || != TEE_SUCCESS){ 
res == TEE_ERROR_TA_DEAD) { res = TEE_ERROR_OUT_OF_MEMORY; 
discardMsg(ctx, origin, dest); goto exit; 
} 3 
sea} oot or 


Fig. 7: The patch codes for the MQT-TZ RA (left) and TA (right). 
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Table 3: The results of LTL model checking after applying the patches. 
Prop. Type Safe? |S| Time Prop. Type Safe? |S| Time Prop. Type Safe? |S| Time 


NON 62 35.3 NON T 62 34.8 NON T 62 34.1 
P1 MSG T 149 89.9 P3 MSG T 149 89.7 P5 MSG T 149 87.4 
OoM T 203 146.2 OOM T 347 285.2 OOM T 347 288.6 
NON T 62 35.1 NON T 62 34.7 NON T 62 34.4 
P2 MSG T 149 89.9 P4 MSG T 149 89.4 P6 MSG T 149 87.9 
OOM T 347 294.8 OOM T 347 278.5 OOM T 347 286.1 


RA, modifications are made to ignore the re-encrypted message when a memory 
error or panic notification is received. Additionally, we have implemented the 
discardMsg function to handle the cleanup of the re-encrypted message. 

To validate the patches, we have performed the LTL model checking from 
the previous section again. As shown in Table B] P2 and P3 become safe (marked 
as red), while all other results remain the same. In addition, we observe that the 
state space is reduced up to approximately 185 thousand states compared to the 
original experiment. This is because the patches discarded the states related to 
memory error or panic. 

In addition, we have identified redundant functions in the TA program using 
formal analysis. For example, TEE_ResetOperation is called right after allocating 
a cryptographic operation. Since the operation has not started, it remains in its 
initial state and thus the reset operation has no effect. These redundancies can 
be safely removed. To show this, we have collected all final states of the program 
with and without redundancies and compared them. We confirm the reachable 
states of the programs (with and without redundancies) are the same. 


7 Related Work 


Many studies have investigated the formal analysis of protocols leveraging TEE. 
The work [13] introduces a protocol for Wasm applications, and verifies the cor- 
rectness of its authentication, such as aliveness and non-injective agreement. 
Another work [22] presents a protocol for secure remote credential management 
using TEE, which is verified against the Dolev-Yao model. Both papers have 
proven the correctness of their protocols by model checking. On the other hand, 
the paper [24] formally analyzes direct anonymous attestation schemes running 
on secure hardware through theorem proving. The papers [18[19] employ a simi- 
lar approach, but aim at verifying remote attestation services of TEEs provided 
by Intel. However, unlike our work, they focus on specific protocols and do not 
propose a formal analysis framework for general TTEE-based applications. 

A formal analysis technique for an IoT framework using TEE is presented 
in [23]. It provides a hierarchical colored Petri net for Trusted IoT Architec- 
ture (TIoTA), which aims to protect data in IoT networks. This approach has 
been used to verify security properties in CTL by model checking. However, it 
is specifically tailored to TIoTA and cannot be applied to general TEE-based 
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applications. In contrast, our work aims to provide a formal analysis framework 
for general TEE-based applications, written in any programming language whose 
operational semantics is specified in K. 


8 Concluding Remarks 


We have presented a formal specification for TEE APIs using Maude. We have 
specified two important TEE APIs (Trusted Storage API and Cryptographic 
Operations API) that are fundamental to mobile and IoT applications. We have 
leveraged Maude’s object-oriented specification to reduce a representation gap 
between the standard document and the formal model, allowing us to effectively 
specify the complex architectures and behaviors of the TEE APIs. 

The effectiveness and feasibility of our approach have been demonstrated 
through formal analysis of MQT-TZ [2120], an open-source TEE application 
for IoT. We have analyzed security requirements of MQT-TZ under given threat 
models. Our formal analysis has revealed security vulnerabilities in the MQT-TZ 
implementation. We have patched a code-level bug and verified the previously 
violated requirements. 

The future work includes providing comprehensive formal specifications for 
TEE APIs, covering the time API, TEE arithmetical API, and peripheral and 
event APIs. Additionally, we should verify the TEE API itself or generate test 
cases for real-world validations using our formal specification. Another important 
direction involves developing state space reduction techniques to enhance the 
efficiency of TEE application analysis. 
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Abstract. Blockchains are decentralized systems that provide trustable 
execution guarantees through the use of programs called smart contracts. 
Smart contracts are programs written in domain-specific programming 
languages running on blockchains that govern how tokens and cryptocur- 
rency are sent and received. Smart contracts can invoke other smart con- 
tracts during the execution of transactions initiated by external users. 
Once deployed, smart contracts running code cannot be modified, so 
techniques like runtime verification are very appealing for improving their 
reliability. Moreover, the conventional model of computation of smart 
contracts is transactional: once operations commit, their effects are per- 
manent and cannot be undone. Therefore, errors in smart contracts may 
lead to millionaire losses of money. 

In this paper, we present the concept of future monitors which allows 
monitors to remain waiting for future transactions to occur before com- 
mitting or aborting. This is inspired by optimistic rollups, which are 
modern blockchain implementations that increase efficiency (and reduce 
cost) by delaying transaction effects. We exploit this delay to propose 
a model of computation that allows bounded future monitors. We show 
our monitors correct respect with legacy transactions, how they imple- 
ment bounded future monitors and how they guarantee progress. We 
illustrate the use of bounded future monitors by implementing correctly 
multi-transaction flash loans. 


1 Introduction 


Blockchains [20] were first introduced as distributed infrastructures that elim- 
inate the need of trustable third parties in electronic payment systems. Mod- 
ern blockchains incorporate smart contracts [2728] (contracts hereon), which 
are stateful programs stored in the blockchain that govern the functionality of 
blockchain transactions. Users interact with blockchains by invoking contract4] 
whose execution controls the exchange of cryptocurrency. Contracts allow so- 
phisticated functionality, enabling many applications in decentralized finances 
(DeFi), decentralized governance, Web3, etc. 


* This work was funded in part by PRODIGY Project (TED2021-132464B- 
100)—funded by MCIN/AEI/10.13039/501100011033/ and the European Union 
NextGenerationEU/PRTR—by DECO Project (PID2022-138072OB-100)—funded 
by MCIN/AEI/10.13039/501100011033 and by the ESF+—and by a research grant 
from Nomadic Labs and the Tezos Foundation. 

3 Non-contract addresses can be considered as unit contracts. 
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Contracts are written in high-level programming languages, like Solidity [2] 
and Ligo [4], which are then typically compiled into low-level bytecode languages 
like EVM [28] or Michelson [I]. Even though contracts are typically small com- 
pared to conventional software, writing contracts is notoriously difficult. The 
open nature of the invocation system—where every contract can invoke every 
other contract—facilitates that malicious users break programmer’s assumptions 
and steal user tokens (e.g. [23]). Once installed, contract code is immutabld"| and 
the effect of running a contract cannot be reverted (the contract is the law). 

Two classic reliability approaches can be applied to contracts: 


— static techniques ranging from static analysis [26] and model checking [22] 
to deductive software verification techniques [3[21/814], theorem proving 
assistants or assisted formal construction of programs [25]. 

— dynamic verification [15/6/18]10| dynamically inspecting the execution of 
contracts against specifications taking corrective measures. 


We follow in this paper a dynamic monitoring technique. Monitors are a defen- 
sive mechanism to express desired properties that must hold during the execution 
of the contracts. If the property fails, the monitor fails the whole transaction. 
Otherwise, the execution finishes normally according to the contract code. In 
practice, monitors are mixed within the contract code, which limits the proper- 
ties that can be monitored. In [10], the authors presented a hierarchy of moni- 
tors, including operation and transaction monitors. An operation monitor for a 
contract A runs alongside A and reads and modifies specific monitor variables 
stored in A [5618]. Operation monitors can only execute when A is invoked 
and cannot inspect or invoke other contracts. Transaction monitors [10] can in- 
spect information across a full transaction, even after the last invocation of A 
in the transaction. For example, the return of a loan within the transaction is 
an important property that can be monitored with a transaction monitor and 
not by an operation monitor, because a transaction must fail if the money lent 
is not returned by the end of the transaction. 

Traditional blockchain systems cannot implement transaction monitors [10], 
but fortunately, this is easy to achieve by extending the execution model with 
two simple features: a first instruction and a Fail/NoFail hookup mechanism. 
Instruction first returns true during the first invocation of the contract in 
the current transaction. The Fail/NoFail mechanism equips each contract with 
a new flag, fail, that can be assigned (to true or false) during the execution 
of the contract (and that is false by default). The semantics of fail is that 
transactions fail if at least one contract has its fail flag set to true at the end 
of the transaction. 

In this paper, we study an even richer notion of monitors that enables to fail 
or commit depending on future transactions. Future monitors can predicate on 
sequences of transactions during a bounded period of time. This period of time, 
called the monitoring window is fixed a priori. 


* Although there are techniques to upgrade the behaviour of smart contracts, like 
proxy patterns and diamond proxy [19], the actual code does not change. 
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Optimistic rollups. Future monitors can be implemented easily in Layer-2 
Optimistic Rollup4?] which are an approach to improve blockchain scalabil- 
ity by moving computation and data off-chain. The most popular optimistic 
rollup implementation is Arbitrum [9], implemented on top of the Ethereum 
blockchain [28]. Arbitrum offers the same API as Ethereum, allowing to install 
and invoke Ethereum contracts. Arbitrum transactions are executed off-chain 
and their effects are submitted as assertions. Assertions are optimistically as- 
sumed to be correct and a fraud-prove arbitration scheme allows to detect invalid 
assertions. Assertions are pending during a challenging period? |to allow observers 
to check their correctness. The arbitration game consists of a bisection protocol, 
played between the challenger and asserter, which has the property that the hon- 
est player can always win the dispute. Assertions that survive until the end of 
the challenge period become permanent. Future monitors can exploit the delay 
imposed by the challenging period to fail or commit based on information from 
the future. 


Bounded Future Monitoring. In this article, we enrich transaction monitors 
with a controlled ability to predicate about the future evolution of blockchains. 
Contracts are extended to include: txid, failmap, and timeout. The instruc- 
tion txid returns the (unique) current transaction identifier. Each contract 
is equipped with a map failmap indicating—for each transaction involving the 
contract—whether the future monitor of the transaction is activated or not, 
and if so, its monitoring status (commit, fail or undecided). By default, future 
monitoring is deactivated. Contracts can modify their failmap (1) to activate 
the future monitor of the current transaction, or (2) to commit or fail undecided 
future monitors of previous transactions within the monitoring window. If a con- 
tract sets a past transaction failmap entry to fail, the corresponding transaction 
fails. The timeout function is invoked at the end of the monitoring window to 
decide whether to fail or commit if the future monitor of the transaction is still 
undecided. This guarantees that transactions cannot be pending after a bounded 
amount of time. 

We call our monitors future monitors since the decision to commit or fail may 
depend on transactions that will execute in the future. Future monitors expand 
the monitor hierarchy presented in [10], which included operation and transac- 
tion monitors as well as monitors that involve several contracts (multicontract 
monitors) or even the whole blockchain (global monitors), but always in the 
context of a single transaction. When combined with future monitors, we obtain 
multicontract future monitors and global future monitors, but we leave these ex- 
tensions as future work. A particular subclass of multicontract future monitors 
was studied in focusing on long-lived transactions [I7], whose lifetime span 
blockchain transactions and potentially involve different contracts and parties. 
Fig. [1] shows the updated monitoring hierarchy including future monitors. 


5 Optimistic Rollups for short. 
€ Currently a week. 
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2 Model of Computation 


We introduce now our abstract model of computation to reason about blockchains. 


Blockchains Execution Overview. Blockchains are incremental permanent 
records of executed transactions packed in blocks. Transactions are in turn com- 
posed of a sequence of operations where the initial operation is an invocation 
from an external user. Each operation invokes a destination contract, which 
is identified by its unique address. The execution of an operation follows the 
instructions of the program (the contract) stored at the destination address. 
Contracts can modify their local storage and invoke other contracts. 

Transaction execution consists of executing operations, computing their ef- 
fects (which may include the generation of new operations) until either (1) there 
are no more pending operations, or (2) an operation fails or the available gas is 
exhausted. In the former case, the transaction commits and all changes are made 
permanent. In the latter case, the transaction fails and no effect takes place in 
the storage of contracts, except that some gas is consumed. Therefore, the state 
of contracts is determined by the effects of committing transactions. 


Model of Computation. Our model computation describes blockchain state 
evolution as the result of sequential transaction executions. Blockchain configu- 
rations are records containing all information required to compute transactions, 
such as: a partial map between addresses and their storage and balance, plus 
additional information about the blockchain such as block number. We use X to 
denote blockchain configurations and U to denote balances of external users. 
Transactions are the result of executing a sequence of operations starting from 
an external operation placed by a user. Transactions can either commit, if every 
operation is successful, or fail, if one of its operations fails or the gas is exhausted. 
We use function basicTx, which takes a transaction, a blockchain configuration, 
and balances of external users as inputs, and returns the blockchain configuration 
and the external user balances that result from executing the transaction in the 
input configuration. Additionally, predicate succ indicates whether the execution 
of a transaction commits or fails in a given blockchain configuration and external 
user balances. Furthermore, function discount deducts the specified amount of 
tokens from the balance of the indicated user in the provided external user 
balances. The following relation ~+;, defines the evolution of the blockchain 


Present Future 

Global monitors Global future monitors [future work] 
Multicontract monitors Multicontract future monitors [future work] 
Transaction monitors [10] Future monitors [this work] 


Operation monitors [6[15[18] 


Fig. 1. Monitor hierarchy. The first column belongs to [10]. 


126 Capretto, Ceresa and Sanchez 


using basicTx, succ and discount: 


basicTx(ta, X,U) = (X”,U’) U’ = discount(U, src(tx), cost(ta)) 
succ(ta, X,U) = commit succ(ta, X,U) = fail 
commit fail 
EU ~u SU S Usa U 


If a transaction fails (rule fail), the blockchain configuration is preserved, 
but the external user originating the transaction pays for the resources con- 
sumed. Cost and resource analysis are out of the scope of this paper, so we 
ignore the computation of U. 

Operation and transaction monitors are defined at the operation and trans- 
action level, and thus, they are implemented inside basicTx and abstracted away 
in this model. 


3 Bounded Future Monitored Blockchains 


In this section, we present a modified model of computation supporting future 
monitors. The main addition is the implementation of monitoring transactions 
predicating on future transactions within a monitoring window k. The monitor- 
ing window captures for how long (in the number of transactions) the monitor 
can predicate on. This additional feature enables us to install a monitor per trans- 
action. Future instances of contracts that activated a future monitor can decide 
to either fail or commit the past transaction within the monitoring window. If 
any contract sets to fail the transaction future monitor of a past transaction, the 
monitored transaction fails. Otherwise, when all contracts that monitor a given 
transaction commit the transaction becomes permanently committed. 


3.1 Future k-bounded Monitors 


Transactions can commit or fail depending on their subsequent k transactions, 
and thus, the post-state after executing a transaction may depend on future 
transactions. At any given point in time, transaction future monitors may: 

— fail because at least one contract involved set the monitor to fail; 

— commit because all contracts involved set the monitor to commit; 

— stay pending. 
Therefore, we identify three transaction monitor states: known to fail, (denoted 
by Fail), known to commit (denoted by Commit) and undecided (denoted by ?). 
Finally, we add another value to represent transactions without monitors: None. 


Failing Map. A contract C can only interact with the future monitor of trans- 
action t if C was involved in t. To keep track of different monitors for C (for 
different transactions), every contract C has a map, called failing map, from 
transactions to monitor states. 

At the start of a transaction, the monitor is deactivated and can only be 
activated during the current transaction. Therefore, if at the end of a transaction 
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t no contract updated the failing map of its monitor for t, then the behavior is 
like legacy unmonitored transactions (as previously described in Section B). 

A contract C can modify its failing map many times but only the entries of 
those transactions where C' was involved and ac- 
tivated the monitor. Changes to failing maps at os Zo 
the end of transactions can be (1) the activation ` ? ra 
of the monitor for the current transaction (from se pa 
None to Fail, Commit, or ?, indicated by dashed ` None- 
arrows in Fig. |2); or (2) decisions reached for 
undecided monitors (from ? to Fail or Commit, 
indicated by plain arrows). 


Fig. 2. Monitor transitions. 


Timeout. Contracts have a new special function called timeout that can be used 
to describe the decision of undecided monitors at the monitoring window. Func- 
tion timeout takes a transaction identifier and returns either Fail or Commit and 
it is set by contracts. The default timeout function returns Commit. 

At the end of the monitor window, the system invokes timeout if the failing 
map entry for that transaction is marked as ?. If at least one contract involved in 
the transaction decides to fail, the transaction fails, and otherwise the transaction 
commits. 


3.2 Extending the Model of Computation 


We extend blockchain configurations with a future monitor context A associat- 
ing contracts with their failing map and timeout function. 


Transaction Execution: Transactions can immediately commit or fail, or depend 
on future transactions that happen within the monitoring window, so the exe- 
cution of a transaction can return one of the following cases: 

— a new configuration as an immediate commit, 

— a new configuration as an immediate fail, 

— two possible new configurations, one for failing and one for committing, which 

depends on the future. 

These behaviors are captured by a new function applyTx that checks if future 
monitors were activated during the transaction. Future monitors restrict the 
behavior of the blockchain, because they only modify the blockchain evolution 
making transactions fail more often. 

Non-monitored transactions either immediately commit or fail based on func- 
tion succ, and their effects are equivalent to the traditional model. 

The function applyTx, when applied to a monitored not failing transaction, 
returns two blockchain configurations, describing the only two possible futures. 
The first configuration represents the effects if the transaction commits, and the 
second represents a failing transaction, so in these cases the post-configurations 
are identical to the previous configurations (modulo resources consumed). 

A contract C can only modify its failing map to activate the future monitor 
of the current transaction or to decide future monitors that C had previously 
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activated but not yet decided. If a contract incorrectly updates its failing map, 
the current transaction fails. When transactions fail, the system does not modify 
any failmap map or timeout function. 


Blockchain System. There are two types of transactions: permanent (committed 
or failed) and pending transactions. Blockchain runs are pairs (H,7r) consisting 
of a sequence H of consolidated blockchain configurations called the history 
and a directed tree 7 where each internal node has one or two children. H 
contains only permanent transaction. Tree 7 is called the monitoring tree and 
includes pending transactions. Each node in the monitoring tree is a blockchain 
configuration. The monitoring tree represents all possible sequences of blockchain 
states that the list of pending transactions can generate. Exactly one path in the 
tree will eventually survive and become part of H, which depends on whether 
the corresponding transactions commit of fail. Each level in the tree corresponds 
to the execution of transactions up to that level but different configuration at 
the same level is a different possible reality. To simplify notation, we use n to 
refer to the blockchain configuration captured by node n in the tree. The root of 
the monitoring tree is the last blockchain configuration that was consolidated, 
that is, the last blockchain configuration in the history sequence. 

The height of the monitoring tree is at most k. It can be shorter than k 
at the genesis of the blockchain but once the first k transactions have been 
executed the monitoring tree reaches and maintains a height k. In the worst 
case, depending on the contracts deployed in the blockchain, the monitoring 
tree can have 2*+!— 1 nodes, but in general not every transaction is going to be 
monitored which reduces the branching and hence the size of the tree. 

Fig. |3| shows a blockchain run (H,7T). The first j + 1 transactions are per- 
manent and the last k transactions are pending. The last permanent blockchain 
configuration is (X, A) and it is also the root of the monitoring tree r. When the 
first pending transaction, tj+1, executes from configuration (X, A), a contract 
C that executed in ¢;41 activated the transaction future monitor generating a 
branching in 7. However, not all transactions generate a branching in the moni- 
toring tree as not all transactions are necessarily monitored, (for example t;+,). 
Configuration (X, A’) is a one of the possible outcomes of executing all pending 
operations. 


Notation. We use the following functions: 
— nextTx(n): returns the transaction that labels the outgoing edges from n. 


— The successor of a node n Š n’ in the monitoring tree. 

— successors(n): given a node n that is not a leaf, returns all successors of n, 
which can be (ne, nz), where ne is the committing successor and ny the 
failing successor, or n’ if n is not branching. 

— the committing subtree of n: the maximal subtree rooted at the committing 
successor of n. 

— the failing subtree of n: the maximal subtree rooted at the failing successor 
of n. 

— allFutures(n): the set of leaves reachable from node n. 
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Fig. 3. A blockchain run of j + 1 permanent transactions and k pending transactions. 


Consider n ++ n’. The configuration at n’ is one of the possible results of execut- 
ing transaction t from the blockchain configuration at n. For simplicity, when 
referring to a monitoring tree 7 with the root node n, we use the terms 7 and 
n interchangeably. Thus, successors(7) denotes the successors of the root node 
of T. The possible futures of the root node of monitoring tree 7, denoted by 
allFutures(7), is referred as the futures in 7. 


Example 1. The following figure shows an example run after 7 transactions, 
starting at initial blockchain configuration No and monitoring window k = 2. 
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History H corresponds to the first 5 permanent transactions. The remaining 
transactions are pending forming a directed tree r whose root is N5. The trans- 
action at node Ns is nextTx(Ns) = ts. Node Ns successors are successors(V;) = 
(Ng, NÉ). The committing subtree of Ns is the subtree with root NẸ and the 
failing subtree of N; is the subtree with root N . Finally, the futures in 7 are 
allFutures(7) = LN NS, NE, NP}. We annotate with superscript c and f the 
committing and failing transactions, respectively, and group them in sequences 
describing paths in monitoring trees. 
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function step((H,7), t) function attach(r, t) > Extends monitoring trees. 
T’  attach(r, t) per 
if height(r’) < k then for l € allFutures(r) do 
return(H,7’) switch applyTx(t,/) do 
else case Commit(I.) : r/.add(I — le) 


T” 4+ decide(r’) ; oo t 
Benesis case Fail(l;) : r'.add(l > lș) 


: at t 
Hadd(r ™ 7”) case Pending(I.,l;): 7’.add(l => (le, ly)) 


1 
return(H, T”) return T 


Fig. 4. Functions step and attach. 


3.3 Blockchain evolution 


The evolution of the blockchain is defined by function step (see Fig. |4) which 
takes blockchain runs and transactions, and extends runs. The system has only 
one rule: 

step((H, T), t) = (H’,7’) 
(H, T) >: (H’,7') 
Valid traces are defined by the relation —> and consist of chains of related 
blockchain states (Ho, To) >to (H1, T1) >t ... where (Ho,To) is an initial 
blockchain run with mo = Ho = (X, A). 

Let (H,7) be a blockchain run and t a transaction. We extend the monitoring 
tree T by adding a new level attaching t from every possible leaf, which increases 
by one the height of 7 (see Fig. (4p. Let 7’ be the result of attach(r, t). If r” has 
height k + 1, the monitoring window for the first transaction in 7’ has expired 
and its monitor must fail or commit. To take this decision, function step invokes 
function decide. The resulting monitoring tree 7” returned by function decide 
becomes the new monitoring tree. Finally, function step extends H making the 
first pending transaction permanent. 

Function decide (see Fig. |5) determines whether to commit or fail the first 
pending transaction tz in monitoring tree r with height k + 1 returning either 
the committing or failing subtree of 7. If 7 has only one successor, the decision 
is trivial, otherwise we analyze tx possible futures. Function decide checks all 
futures assuming ta commits, (i.e., all leaves in the committing subtree of 7); if 
the future monitor of transaction tz commits in all of them, then tz commits 
and the committing subtree of r becomes the new monitoring tree. Otherwise, 
tx fails and the failing subtree of r becomes the new monitoring tree. If decide 
cannot assert whether the monitored transaction fails or commits, decide invokes 
timeout to decide (see function knownToCommitWithTimeout in Fig. [5). 

In some cases, the decision of future monitors is known before the monitoring 
windows ends. In such instances, some nodes are unreachable, called impossible 
nodes. For example, when a transaction future monitor is waiting for a transac- 
tion in the future and that transaction happens before the monitoring window 
ends, the future monitor is going to be set to commit, which turns all nodes 
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function decide(r) > Decides commit/fail of the root transaction of 7 
assert height(r) =k+1 
T’ & prune(r) 
t + nextTx(rT) 
switch successors(7’) do 
case 7”: return 7” 
case (Tc, Tf): 
if Vl € allFutures(7.) : knownToCommitWithTimeout(l,t) then return 7, 
else return T+ 


function prune(7) 
if 7 is a leaf then return 7 
t  nextTx(rT) 
switch successors(7) do 


case r’: return T — prune(r’) 
case (Tc, Tf): 
T4 & prune(T-) 
Ty + prune(tp) 
if VI € allFutures(r/) : knownToCommit(I, t) then return T > 7} 
if VI € allFutures(rZ) : knownToFail(/,t) then return 7 > TH 


return 7T —> (r4, T+) 


function failmapCommit(A, c, t) return Al[c].failmap[t] = Commit 
function failmapFail(A, c, t) return Alc].failmap|¢] = Fail 
function timeoutCommit(A, c, t) return Al[c].timeout|t] = Commit 
function undecided(A,c,t) return A[c].failmap|t] = ? 
function monitoringContracts (l, t) return {c : 1.A[c].failmap[t] 4 None} 
function knownToCommit (I, t) 

return Vc € monitoringContracts(l, t) : failmapCommit(J.A, c, t) 
function knownToFail (l, t) 

return Jc € monitoringContracts(l, t) : failmapFail(U.A, c, t) 
function commitWithTimeout(A, c, t) 

return failmapCommit(A, c, t) V (undecided (4A, c, t) A timeoutCommit(A, c, t) 
function knownToCommitWithTimeout (l, t) 

return Vc € monitoringContracts(l, t) : commitWithTimeout(J.A, c, t) 


Fig. 5. Functions decide, prune and auxiliary functions. 


in its failing subtree impossible nodes. Concretely, if in all possible futures in 
the committing subtree of node n its transaction is known to commit, then all 
nodes in the failing subtree of n are impossible nodes. Similarly, if in all pos- 
sible futures in the committing subtree of node n its transaction is known to 
fail, then all nodes in the committing subtree of n are impossible. Impossible 
nodes are removed before deciding whether a transaction commits or not, since 
we may incorrectly deduce that a monitor fails because of an impossible future 
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Fig. 6. Application of function step in a blockchain run. 


node. Consequently, decide invokes prune to remove all impossible nodes, and 
only then, decide determines whether the root transaction commits or not as 
explained above. 

Function prune (see Fig. |5) shows how to prune impossible nodes from trees. 
To guarantee that impossible nodes are pruned before checking if roots of trees 
are impossible (either commit or fail), we perform a bottom-up recursion. 


Example 2. Fig. [6] shows the result of applying function step to blockchain run 
(H,7) with a monitoring window k = 2 and two pending transactions t; and 
t;41. Each node in the monitoring tree is annotated with the monitor state of 
all pending transactions up to that node: a question mark means undecided 
monitors, a tick means known to commit monitors, a cross means known to fail 
monitors, and a dash denotes no monitored transactions. Initially, no monitors 
are decided in any node in 7. 

Function step((H, T), ti+2) first invokes function attach(r, ti+2). This function 
adds a new level to 7 by applying transaction t;42 at all leaves in 7, obtaining 
monitoring tree 7’, Fig. [6[a). Transaction t;+2 immediately commits at all leaves 
in 7, generating nodes N°, Nfe, NF and N°. The future monitor for transac- 
tion t; is known to fail at node N% while remaining undecided at node N° and 
the future monitor for transaction t;+ı is known to commit at nodes N° and 
Nfe. Next, as the height of the new monitoring tree, 7’, is 3 > 2, function step 
invokes function decide(r’) to decide if the first pending transaction, t;, fails or 
commits. Function decide invokes function prune to remove all impossible nodes 
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in 7’. When computing prune, the failing subtree of node N°, rooted at node 
Nf, is removed because at node N° the future monitor for the transaction 
at node N°, ti+1, is known to commit and node N° is the only future in the 
committing subtree of node N°, making the subtree rooted at N° an impossible 
subtree. Similarly, the subtree rooted at NF is an impossible subtree and it is 
also removed by function prune. 

Subtrees with roots N° and NË are the only ones removed when applying 
function prune to monitoring tree 7’, as shown in Fig. [6{b). 

Finally, to decide whether to commit or not transaction t; function decide 
consider node N°, as it is the only future in the committing subtree of node 
N in the monitoring tree returned by function prune. At node N° the future 
monitor for transaction t; is undecided. However, since its monitoring window 
has ended, function decide uses the timeout of the contracts that are undecided. 
Assuming for all undecided contracts their timeout function commit transaction 
ti, then function decide commits transaction t;, returning the subtree rooted at 
N°! as the new monitoring tree (see Fig. [6{c)), it would fail if at least one contract 
timeout function fails. Finally, function step extends H by making transaction 
ti permanent. If prune had not been applied before function decide evaluated all 
futures in the committing subtree of N, transaction t; would have incorrectly 
failed, as in impossible future N°!°, the future monitor for transaction t; fails. 


An example of contracts that only lend their tokens if they receive them back 
within 2 transactions in the future can be found in [IJ]. 


4 Properties 


We discuss now properties of the model of computation defined in Section [B] In 
particular, we establish how the new model extends the previous one, that the 
size of monitoring trees is manageable, and the blockchain always progresses. 
We assume a fixed monitoring window k. All proofs can be found in [I]. 

After the monitoring window has expired, the root transaction is confirmed 
and one of two possible successors is consolidated. 


Lemma 1. Let (H,T) be the system run after k transactions, t a transaction 
and (H',r') = step((H,7),t). The root of T' is one of the successors of the root of 
T and all paths in T' without leaves are also paths in T. Moreover, H' is obtained 
by extending H with the first pending transaction on T. 


The first k transactions from the genesis are just added to the tree. From 
the previous lemma, after k transactions and when a new step is taken, the first 
pending transaction is either committed or failed and a new pending transaction 
is attached to all leaves. Moreover, the transaction added to the history is the 
root of the previous monitoring tree and one of its successors is the root of the 
new monitoring tree. In other words, exactly one of the paths in the monitoring 
tree eventually becomes permanent, and thus, the blockchain always progresses. 
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Corollary 1 (Progress). Function step is total and, after the first k invoca- 
tions, each execution of step makes one transaction permanent. 


The height of the monitoring tree is bounded by the monitoring window. 


Lemma 2 (Bounded Certainty). Let + be a monitoring tree in a blockchain 
run obtained by applying function step | times. Then, the height of T is the 
minimum between l and k. Moreover, all leaves in T are in its last level. 


Function prune removes all impossible nodes from monitoring trees. Func- 
tion prune recursively removes impossible nodes in the committing and failing 
subtrees, and then, determines if it can remove any subtree by inspecting all 
possible futures in the committing successor. 


Lemma 3. Function prune(T) returns a sub-monitoring tree of T without im- 
possible nodes and only impossible nodes were removed. 


Function step consistently makes the blockchain progress. After more than k 
transactions were added, the first pending transaction is made permanent (see 
Corollary g. The resulting monitoring tree keeps the order of the rest of the 
pending transactions and it also preserves the same information of the pending 
transactions except the last. 


Lemma 4. Let 7 be a monitoring tree, n be the result of expanding T with a new 
transaction, t be the first pending transaction in T, and v be the decided subtree 
ofn. 
— Ifn has only one successor then v is the result of pruning n’s successor. 
— If n has two successors, then let n: and ns be the result of pruning the com- 
mitting and failing subtrees of n respectively. 

e Monitoring tree v is ne if in all possible futures assuming t commits, 
transaction t does not fail or if no decision has been reached, all pending 
timeout functions of t commit. 

e Monitoring tree v is nf if there is a possible future where assuming trans- 
actions t commits, leads to the monitor of t fail or some of the pending 
timeout function of t fail. 


The size of monitoring trees can be exponential in the number of monitored 
transaction rather than in the monitoring window size, as monitored transactions 
are the only ones branching monitoring trees. 


Lemma 5. Let 7 be a monitoring tree and m be the number of monitored trans- 
actions int (som < k). Then, the size of T is in O(2™ x k). 


In practical scenarios, the number of monitored transactions typically is small 
compared to the monitoring window because most transactions do not require 
future monitors. This makes the size of the monitoring tree much smaller than 
the theoretical maximum. 


Corollary 2. If the number of monitored transactions in monitoring trees is 
constant then the size of monitoring trees is bounded by O(k). 
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Finally, we show that adding future bounded monitors preserves legacy ex- 
ecutions, so for blockchain runs where no contracts use future monitors, the 
monitoring tree is a chain with no branching. 

A legacy monitoring tree T is such that every configuration obtained from 
applying applyTx coincides with rule ~. 


Lemma 6 (Legacy Pending Transactions). Let 7 be a legacy monitoring 
tree. Then, T is a chain and the effect of executing all transactions in T is equiv- 
alent to executing them in the traditional model of computation. 


If we add that the permanent history is equivalent (up to now) to the tradi- 
tional model, then the evolution of the blockchain in both models coincide. 


Lemma 7 (Legacy History). Let r be a legacy monitoring tree and H be 
a history such that every permanent transaction coincides with rule ~~. Then, 
the result of concatenating H and T is equivalent to the traditional model of 
computation. 


From Corollary |1| and Lemma [7] we conclude that the new model of com- 
putation is consistent with the previous model of computation and eventually 
creates a chain. Additionally, Corollary [2]implies that in practical scenarios, the 
size of monitoring trees is linear on the monitoring window, making it a feasible 
and practical blockchain implementation. 


5 Atomic Loans 


Flash loan contracts allow other contracts to borrow tokens without any col- 
lateral only if the borrowed tokens are repaid during the same transaction [12] 
(typically with some interest). Atomic loans are a generalization of flash loans 
where the borrowing party can repay the lending party in future transactions. It 
is not possible to implement flash loans unless additional mechanisms are added 
to the blockchain [10]. Similarly, it is impossible to implement atomic loans in 
traditional blockchain computational models. As transaction monitors [IQ] en- 
able flash loans transactions, future monitors allow monitors to check properties 
across transactions enabling atomic loans. We illustrate now how to implement 
atomic loans using the monitoring window as the maximum payback time. 

We specify lender contracts as contracts respecting the following two prop- 
erties: 


Specification 1 (Atomic Loans) We say contract A is an atomic lender if: 
AL-safety: A loan from A is repaid to A within the monitoring window. 
AL-progress: Contract A grants loans unless AL-safety is violated. 


The following contract FlashLoanLender shows a simple contract implement- 
ing a flash loan lende] using Fail/NoFail hookup [I0], i.e. with no future moni- 
tors but transaction monitors. We highlight monitor code with gray background. 


T Flash loan lender are atomic loan lenders with paying back window of one. 
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contract FlashLoanLender { 
uint pending_returns = 0; 
uint fee; 
function lend(address payable dest, uint amount) public 
{ require(amount <= this.balance); 
dest.receiveLoan{value: amount}(fee) ; 
pending_returns += amount + fee; 


this.fail = (pending_returns != 0); } 
function returnLoan() external payable 
{ pending_returns -= msg.value; 
this.fail = (pending_returns != 0); } } 


Function lend lends as long as the lender has enough funds, annotates the 
borrowed tokens in pending_returns and sets its fail bit so the transaction 
commits only if the loan is paid back. When the loan is returned, returnLoan de- 
creases pending_returns and updates its fail bit. At the end of each transaction, 
if there are pending loans the fail bit will make the transaction fail. 

The above contract implements flash loans that must be returned within a 
transaction, but does not work properly if future transactions are considered. It 
is not possible to successfully predict or check whether the loan is returned in 
some future transactions. We show now how future monitors solve this problem. 

The following contract Lender is an atomic lender using future monitors. All 
loans are treated equally and should be paid back on time, and if one loan is not 
returned, then all loans issued at the same transaction would be rejected. Here 
we are being too strict compared to practical cases, but it is enough to illustrate 
the use of future transaction monitors. 


contract Lender { 
uint fee; 
function lend(address payable dest, uint amount) public 
{ require(amount <= this.balance); 
dest.receiveLoan{value: amount}(fee); 
pending_returns[mse.txid] += amount + fee; 
if (pending_returns [r o ase Ol OD) 
this.failmap[msg.txid] = UNDECIDED; } 


function returnLoan(txId id) public 


{ pending_returns[id]-= ms 
if (pending_returns [id] 
} with monitor { 
map<txId, int> pending_returns ; 
function timeout Ctxid id) { return HALL: F } 


g.value; 
0) this.failmap[id] = COMMIT; } 


Contract Lender uses a map pending_returns, from transactions to the amount 
borrowed within that transaction, to determine whether a transaction should 
commit or fail. Function lend grants a loan if the lender has enough funds, 
increases the corresponding entry in map pending_returns for the current trans- 
action and sets the failmap entry activating the current transaction monitor. 
Client contracts can repay loans by invoking returnLoan, which receives the 
transaction identifier of the lending transaction to decrease the corresponding 
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Fig. 7. Balance of contracts NC and L in the monitoring tree after executing the three 
transactions posted by a client. 


entry in pending_returns by the amount received. If pending_returns reaches 0 
for a given transaction, the failmap entry of that transaction is set to COMMIT. 
Finally, timeout returns FAIL to fail transactions with unpaid loans at the end 
of their monitoring window. 


Clients can request loans without further collateral, satisfying AL-progress, 
and if loans are not returned within the monitoring window, the lending trans- 
action will retroactively fail, satisfying AL-safety. 


The following contract NaiveClient requests a loan invoking borrow. 


contract NaiveClient { 
map<pair<txId,Lender>, uint> toPay; 
function borrow(Lender 1, uint amount) onlyOwner 
{ l.lend(amount) ; 
toPay[(msg.txid(),1)] = amount; } 
function receiveLoan(uint fee) 
{ toPay[(msg.txid,msg.sender())] += fee; } 
function invest() onlyOwner { ... } 
function payBack(Lender 1, uint amount, txId id) onlyOwner 
{ require (toPay[(id,1)] >= amount); 
toPay[(id,1)] -= amount; 
l.returnLoan{value: amount}(id); } } 


In subsequent transactions, the client can invest the funds, and in a final 
transaction, return the loan to the lender invoking payBack. 


Let NC and L be two contracts installed in a blockchain with a monitoring 
window of length 2, where NC runs NaiveClient and L runs Lender. Consider 
(X, A) to be the current state of the blockchain at which NC has 100 tokens and 
L has 1000 tokens. From (X, A), the sequence of transactions is: (1) NC requests 
a loan, (2) NC invests assuming contract L lends the money, and (3) NC returns 
the loan. Because L employs future monitors to guarantee clients pay back, the 
first transaction generates a branching on the blockchain evolution. The next two 
transactions are not monitored, thus they do not create any branching. There- 
fore, after these three transactions, there exist two possible futures as shown in 
Fig[7] one where L grants the loan and another where it does not. We can see 
that NC pays back in all possible futures. Moreover, contract NC pays back even 
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in the future where contract L fails the past lending operation (for a detailed 
explanation see [Ii]). 

A malicious lender can take advantage of such behavior, for example using 
the following contract MaliciousLender. 


contract MaliciousLender { 

uint fee; 

function lend(address payable dest, uint amount) public 
{ dest.receiveLoan{value: amount}(fee); 

this.failmap[msg.txid] = UNDECIDED; } 

function returnLoan(txId id) public { return; } 

} with monitor { 
function timeout(txId id) { return FAIL; } } 


The above malicious lender, upon receiving a loan request in function lend, if 
it has enough tokens, it grants the loan and marks the transaction as undecided 
using its failmap map. However, this lender contract does not update its failmap 
map when receiving paybacks. Therefore, at the end of the monitoring window, 
the monitor remains undecided making the lending transaction fail due to the 
timeout function. In other words, the malicious lender never lends any tokens, 
as all its loans are reverted, but it looks like it does. When combined with 
NaiveClient and the same three transactions described earlier, the malicious 
lender will receive the repayment of a loan from client NC without having given 
the loan. In Fig. |7| the bottom branch is the one that survives when the lender 
implements a malicious contract. 

The problem arises because client NC does not implement any mechanism 
to check in which branch it is executing when repaying the loan. The naive 
contract does not distinguish between the scenario where the loan will ultimately 
be committed and the scenario where it will fail. As a result, client NC ends up 
providing payments in both cases. 

The following contract Client presents a correct client implementing two 
maps,requested and toPay, to keep track of the amounts requested from lenders 
and its debts owed to lenders, respectively. 


contract Client { 
map<pair<txId,Lender>, uint> toPay, requested; 
function borrow(Lender 1, uint amount) onlyOwner 
{ l.lend(amount) ; 


requested[(msg.txid,1)] = amount; } 
function receiveLoan (uint fee) 

{ require (requested[(msg.txid,msg.sender)] == msg.value); 
requested[(msg.txid,msg.sender)] = 0; 
toPay[(msg.txid,msg.sender)] = msg.value + fee; } 

function invest() onlyOwner { ... } 


function payBack(Lender 1, uint amount, txId id) 
{ require (toPay[(id,1)] >= amount); 
toPay[(id,1)] -= amount; 
l.returnLoan{value: amount}(id); } } 
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Fig. 8. Balance of contracts C and L in the monitoring tree after executing the three 
transactions posted by a client. 


The above contract allows clients to determine the specific path in which 
it is executing, and thus, to decide whether to repay. Consequently, clients can 
successfully get loans from correct lenders while being resistant to attacks from 
malicious lenders. 

Fig. [8] shows an execution following the same transactions as before but with 
the correct contract Client: clients request a loan, invest the money, and payback 
the loan. The top branch shows the case where the lender sends the money and 
the client returns it, while the bottom branch shows the case where the loan is 
not given. In the former cases, the client returns the money, and in the latter 
case, the client just fails the transaction. 

These examples show how even contracts not monitoring transactions need 
to be aware that transactions can create potential executions in the blockchain 
evolution that may be reverted due to future monitors. Since the same trans- 
action is executed in all possible scenarios, but their effects may be different, 
contracts need to know in which temporal line they are executing and act ac- 
cordingly. Contract Client accomplishes this by maintaining a record of debts 
owed to lenders in variable toPay. 


6 Related Work 


Dynamic verification of smart contracts Runtime monitoring tools like 
ContractLarva and Solythesis take a smart contract code and its prop- 
erties as input and produce a safe smart contract that fail transactions violating 
the given properties. They achieve this be injecting the monitor into the smart 
contract as additional instructions. Therefore, these monitors are restricted to 
one operation in a single contract. Transaction Monitors [10] extend monitoring 
beyond a single operation to observe the effect of an entire transaction execution 
on a given contract. 

While these existing works provide strong foundations for smart contract ver- 
ification, none directly address the ability to react based on future transactions, 
as proposed in this work. 


Branching Computational Models The monitoring tree generated by pend- 
ing transactions might reassemble the tree-like structure in branching-time logic 
such as CTL [13]. However, the branching in the monitoring tree represents all 
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possible futures given by the monitors of the pending transaction, and exactly 
one path eventually consolidates. In particular, future monitors are not aware of 
the existence of other paths in the monitoring tree and therefore cannot reason 
about them. CTL, on the other hand, can be used to express properties that 
reason about different paths in the tree. 


7 Conclusion 


We presented future monitors for smart contracts. Future monitors are a defense 
mechanism enabling contracts to state properties across multiple transactions. 
These kinds of properties are motivated by long-lived transactions, in partic- 
ular by atomic loans, which are not implementable in their full generality in 
current blockchains. To implement future monitors, we introduced the notion of 
monitoring window and two additional new mechanisms to blockchains, namely 
failing maps and timeout functions. 

Future monitors delay the consolidation of transactions, but the system re- 
mains consistent and we gain in expressivity. The outcome of transactions re- 
mains deterministic and depends solely on the transactions themselves, but now 
transactions can fail because of future actions. Combining all elements we ob- 
tained a deterministic semantics with future monitors in place. 

We have also illustrated that contracts need to be aware of the existence of 
possible executions. Future monitors introduce a branching model to describe 
the evolution of blockchain systems where transactions may commit or not, 
caused by the temporary uncertainty regarding the effect of pending transac- 
tions. Consequently, when new transactions are added to the blockchain, they 
are executed in multiple blockchain configurations, representing possible time- 
lines. Therefore, contracts need to be aware of the different contexts in which 
they are executing, ensuring that the transaction produces the desired effects in 
all possible realities. 

The main contribution of this paper is theoretical and we left the full imple- 
mentation of future monitors as future work. Optimistic rollup systems, where 
the effect of transactions is already delayed due to the fraud-prove arbitration 
scheme, present an ideal environment to incorporate future monitors into prac- 
tical blockchain systems without further implications. In particular, optimistic 
rollup systems can allow future transaction monitors with little modifications, 
and more importantly, without modifying the underlying blockchain. 

For simplicity, we have neglected a specific analysis of the additional gas 
consumption that arises for using future monitors, which might lead to the failure 
of accepting transactions. Nevertheless, we conjecture that future monitors are 
simple enough to guarantee that a calculable amount of gas will prevent gas 
failing situations. However, we leave a detailed study for future work. 
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Abstract. Maintaining software is cumbersome when method argument 
constraints are undocumented. To reveal them, previous work learned 
preconditions from exemplary valid and invalid method arguments. In 
practice, it would be highly beneficial to know class invariants, too, be- 
cause functionality added during software maintenance must not break 
them. Even more so than method preconditions, class invariants are 
rarely documented and often cannot completely be inferred automati- 
cally, especially for objects exhibiting complex state such as dynamic 
data structures. 

This paper presents a novel dynamic approach to learning class invari- 
ants, thereby complementing related work on learning method precon- 
ditions. We automatically synthesize assertions from an adjustable as- 
sertion grammar to distinguish valid and invalid objects. While random 
walks generate valid objects, a combination of bounded-exhaustive test- 
ing techniques and behavioral oracles yield invalid objects. The utility 
of our approach for code comprehension and software maintenance is 
demonstrated by comparing our learned invariants to documented in- 
variant validation methods found in real-world Java classes and to the 
invariants detected by the Daikon tool. 


1 Introduction 


Comprehending the behavior of a complex software component is challenging, 
but necessary for component reuse and maintenance. The object-oriented pro- 
gramming paradigm has enforced the principle of information hiding, which sep- 
arates externally observable behavior from internal implementation. To make a 
component reusable, it typically suffices to document its external behavior and 
the constraints imposed on its method argument values. When following the prin- 
ciples of defensive programming [4], a thorough input validation at the entry of 
each method checks whether the constraints are satisfied. For components that 
lack input validation, previous work has shown that appropriate preconditions 


can be inferred automatically [2]8[27[30]33}. 
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To make a component maintainable, however, information on its external 
behavior alone is insufficient, because maintenance may require modifications 
of the component’s implementation. Class invariants [I9[20] capturing the con- 
straints on the component’s program state exhibited at runtime are essential 
for maintainers to ensure that their source code modifications, such as bug fix- 
ing, refactoring, or implementing new functionalities, match the assumptions 
implicitly encoded in the existing source code. A failure to do so may result in 
unpredictable behavior or even system crashes. Despite this, class invariants are 
rarely documented and checked even more rarely during input validation. 

Approaches to dynamic assertion learning generalize from observations, e.g., 
object states, to synthesize assertions such as preconditions and class invariants. 
Related tools include Daikon [8], Proviso [2], Hanoi [22], and EvoSpex [25]. 
Daikon observes program states during execution and uses templates to obtain 
a set of candidate assertions, including class invariants, that hold at certain 
program locations. Proviso learns preconditions that also consider complex data 
types and uses a test generator as an oracle to detect invalid method arguments. 
Hanoi infers representation invariants for data types in a functional programming 
language. EvoSpex employs an evolutionary algorithm to learn postconditions 
from (in)valid pre/post state pairs. Overall, the exploration of approaches to 
dynamic class invariant learning for complex types remains relatively limited, 
despite the potential benefits for software maintenance. 

This paper proposes a dynamic analysis approach that learns a class invariant 
using iterative refinements from (in)valid objects. We perform random walks in 
object state spaces to construct valid objects and combine bounded-exhaustive 
testing techniques [36[18] with behavioral oracles to create invalid objects. As or- 
acles, one can either adapt the random walks or provide property-based tests [9]. 
We refine our candidate invariant by removing existing or introducing new as- 
sertions, which are dynamically constructed along an assertion grammar. This 
process iterates until all obtained (in)valid objects are classified correctly. 

We have implemented our class invariant learning approach for Java in a pro- 
totype tool, called Geminus. Our evaluation shows, for real-world Java classes 
taken primarily from the the java.util package, that our learned class invari- 
ants are at least as accurate as, and often surpass, those detected by Daikon 
or documented in the code. Beyond software maintenance, class invariants also 
support various software development activities, including software testing [13]. 


Organization [Section 2] introduces the notions of class invariant and bounded- 
exhaustive/property-based testing alongside a running example. 
plains our class invariant learning approach and [Section 4jevaluates it. Section 5 
discusses related work, while[Section 6]presents our conclusions and future work. 


2 Foundations 


This section reviews the concepts of class invariant in the context of the object- 
oriented paradigm by means of a running example. We subsequently outline how 
property-based and bounded-exhaustive testing relate to class invariants. 
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1 public class SimpleSquare { 


2 //@ invariant w == h && w > 0; 

3 private int w, h; // width and height 

4 

5 public SimpleSquare() { setLength(1); } 

6 public void setLength(int length) { 

7 if (length <= 0) { throw new IllegalArgumentException(); } 
8 this.w = length; 

9 this.h = length; 

10 } 

11 

12 public int area() { return wth; } 

13 public int perimeter() { return 2*(wth); } 
14 public int aspectRatio() { return w/h; } 
15 

16 public SimpleRectangle toRect() { 

17 return new SimpleRectangle(w, h); 

18 } 

19 } 


Fig. 1: Running example Java class SimpleSquare. 


Running Example The class SimpleSquare in [Figure 1] models a square with a 
non-zero positive length using the two integer attributes width (w) and height (h). 
Other objects can interact with SimpleSquare by invoking its public methods to 
set the length of the square or to compute its geometric properties, or to obtain 
an equivalent object of class SimpleRectangle. Note that method setLength 
performs thorough input validation and throws an IllegalArgumentException 
if the provided method argument value is not strictly positive. 


Class Invariants Objects play a fundamental role in object-oriented program- 
ming. They are created via constructors, interact with other objects via method 
calls, and are disposed by a destructor. Throughout method execution, an ob- 
ject may call methods of other objects, including itself, or alter the accessible 
attributes of other objects. Often, invoking a method results in a side-effect 
or modification of the object’s state, either through modifying its primitive at- 
tributes or by modifying the object state of a referenced object. 

The notion of a class invariant in object-oriented programming has first 
been explored in and since been adapted by specification languages such 
as JML [16]. Understanding class invariants is crucial during development and 
maintenance, because they provide guarantees about the object state at the start 
of a qualified method call [20] and the end of such a call. In contrast, the class 
invariant may not hold for unqualified method calls, which the object invokes on 
itself. For example, calling setLength in the constructor is considered unqualified. 
Accordingly, the class invariant holds for all objects derived via a constructor or 
via a qualified call invoked on an object that satisfies the invariant. 
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1 @Test public void traditionalTest() { 
2 SimpleSquare s = new SimpleSquare() ; 
3 s.setLength(5) ; 


4 assert s.area() == 25; 

5} 

6 

7 @Test public void propertyBasedTest (SimpleSquare s) { 
8 assert s.toRect().area() == s.area(); 


9 s.toRect().toSquare(); // implicitly checks absence of exception 
10 } 


Fig. 2: A traditional and a property-based tests for class SimpleSquare. 


In the running example, the assertion that the width and height are equal and 
strictly positive is a suitable class invariant. Accordingly, method aspectRatio 
does not need to check that attribute h is non-zero to avoid a division-by-zero 
exception, because this is implied by the invariant. Similarly, method toRect 
can assume that constructing a new SimpleRectangle object always succeeds. 

The set of reachable objects that a class invariant has to satisfy can be 
constructed incrementally by performing random walks in the object state space. 
A random walk starts at an object state derived from a constructor and continues 
by invoking methods on the current object; this kind of state exploration is used 
in the context of fuzz testing and test suite generation . Even for finite 
object state spaces, an exhaustive exploration is often practically infeasible. 


Property-Based Testing While traditional tests first establish a testing scenario, 
property-based tests [9] are parameterized over inputs supplied by a test engine. 
Property-based testing is primarily used in functional languages, e.g., in Haskell 
using QuickCheck [5], but can also be applied to object-oriented programs. 

Figure 2| depicts a traditional and a property-based test for our running 
example. Note that the property-based test is parameterized over an object of 
the class under test and checks that the obtained rectangle has the same area 
as the former square. It also implicitly tests that the translation from rectangle 
to square via method toSquare does not raise an exception. 


Bounded-Exhaustive Testing Deriving a representative set of objects, e.g., for 
property-based testing, is often a tedious and error-prone task when done man- 
ually. Bounded-exhaustive testing [6J11]2]] is a testing technique that automat- 
ically tests a software for all valid inputs within specified size bounds. 

While primitive types like integers are often sampled from a range of values, 
complex object states usually require a create-and-test approach: a systematic 
enumeration artificially assigns values to private and public attributes to create 
all object states within a provided bound, and a manually specified predicate, 
i.e., a class invariant, tests for validity and retains valid objects only. 
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Fig. 3: Overview of our approach to dynamic class invariant learning. 


3 Approach 


This section introduces our approach to dynamic class invariant learning, which 
is depicted in [Figure 3] Each step either modifies the set of collected valid (O) 
or invalid (O) objects, or the set of assertions (A) whose conjunction forms the 
candidate class invariant (Z). If an object is reachable, we consider it valid. If an 
object is unreachable, we consider it invalid. The class invariant we aim to learn 
classifies all reachable objects as valid and all unreachable objects as invalid. 

The weakening step aims to refine the candidate class invariant Z by finding 
a valid object o that is classified as invalid by Z. If successful, we remove the 
conflicting, overly restrictive assertion(s) that caused the incorrect classification. 
Previously collected invalid objects that are no longer classified as invalid due 
to the removed assertions are reintegrated subsequently. If no valid object is 
misclassified, we perform strengthening to find an invalid object 6 that is mis- 
classified. The invalid object integration step then derives a matching assertion 
that correctly classifies an invalid object as invalid but all prior found valid 
objects still as valid. If no ð is found, we return the candidate class invariant. 

Because our approach learns from a finite set of objects, the learned class 
invariant is only correct for the collected (in)valid objects, but not in general. 
However, if no assertion can be generated to distinguish a valid from an invalid 
object, the learned invariant correctly classifies only all identified valid objects, 
but mistakenly classifies some invalid objects as valid. 

The high-level weakening, strengthening, and invalid object integration steps 
are generic and can be instantiated by different techniques. Our approach lever- 
ages random walks to generate valid objects and combines bounded-exhaustive 
testing techniques with behavioral oracles to obtain invalid objects. We derive 
assertions to distinguish valid from invalid objects using a grammar. In contrast 
to related approaches [25]30], our objects are guaranteed to be (in)valid. 
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Table 1: Intermediate states of our approach to class invariant learning in each 
iteration, for the SimpleSquare running example. 


current found removed new 

it. assertions new assertions assertions 
A 0/6 Adel Anew 

1 ø o: 00 0 { false} 
2 { false} o:(4_1 { false} {w= 1} 
3 {w=1} õ:d_ 0 i) {w =h} 
4 {w=l,w=h} 0:[2_2 {w= 1} {w > 0} 
5 {w =h,w > 0} L 


shows the execution state of our approach in each iteration when 
learning class invariant w = h A w > 0 for our running example SimpleSquare. 
Valid objects such as |1_ 1| are indicated by a solid box, while invalid objects 
such as |O__0} are shown in a dashed box. The remainder of this section uses this 
example to illustrate the workings of our invariant learning approach. 


3.1 A Triangle of Oracles 


Our approach exploits the insight that an executable implementation, a testable 
assumption, and an object form a closed loop of information. Assuming two 
elements are correct one to allows constructing a test-based oracle to assess the 
correctness of the third. This leads to the creation of three distinct oracles: 


1. Implementation: Given a correct assumption and a valid object, any failure 
upon testing the assumption indicates a faulty implementation. 

2. Assumption: Given a correct implementation and a valid object, any failure 
upon testing the assumption indicates an incorrect assumption. 

3. Object: Given a correct implementation and a correct assumption, any failure 
upon testing the assumption indicates an invalid object. 


The implementation oracle is leveraged in software testing to detect faulty 
implementations. It either encodes assumptions as traditional tests, which create 
objects assumed to be valid by construction and checks assertions on them, or 
as property-based tests, which evaluate properties on valid objects supplied by 
the test engine. When learning a class invariant for a given implementation, one 
can ignore the question of implementation correctness, because the invariant is 
supposed to reflect the implementation. However, a learned invariant that does 
not match the expectations may indicate a faulty implementation. 

The assumption oracle can be employed to identify an incorrect invariant 
that misclassifies valid objects as invalid when considering the invariant as the 
assumption. By generating valid objects in our weakening step, we detect an 
overly restrictive, i.e., unsound, invariant. Analogously, the second oracle can 
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be used to identify invariants that misclassify invalid objects as valid. If an ob- 
ject is invalid, but the candidate invariant holds, the invariant is incomplete, 
which allows our strengthening step to detect overly permissive invariants. We 
consider an invariant/oracle sound if it classifies all valid objects as valid, and 
complete if it classifies all invalid objects as invalid. The objects revealing an 
incorrect candidate class invariant are added to the training set during weaken- 
ing/strengthening, and the invariant is updated accordingly. 

The object oracle can detect invalid objects if implementation and assump- 
tion are correct. Invalid objects can be used by the assumption oracle to spot 
overly permissive invariants. Providing assumptions to detect both valid and 
invalid objects is challenging and equivalent to learning the class invariant. 


3.2 Generating Valid Object States via Random Walks 


The weakening step leverages the assumption oracle to assess whether the can- 
didate class invariant misclassifies valid objects as invalid. To construct valid 
objects, we perform random walks in object state spaces: any object derived via 
a sequence of qualified method calls starting from a freshly constructed object is 
valid. Because the implementation can be considered correct, a method invoca- 
tion in a random walk may only throw expected exceptions, which are associated 
with a failed input validation such as the IllegalArgumentException thrown 
by method setLength. In contrast, unexpected exceptions are prevented by the 
class invariant. For example, a division-by-zero exception cannot be thrown in 
method aspectRatio, because the invariant guarantees that the height is non- 
zero. In practice, all checked exceptions in Java are typically expected exceptions 
and some unchecked exceptions are unexpected exceptions. 

We parameterize the random walks using a set of builders and actions. 
Builders construct fresh objects using the available constructors, and actions 
invoke methods. Following the naming convention of for methods, we use 
the term observer/modifier action to denote an action that does not/does alter 
the considered object’s state. In our example, a single builder invoking the zero- 
argument constructor and a single action invoking method setLength with value 
2 suffice. To enforce termination, we bound the random walk with respect to the 
number of walks and the number of method calls per walk. To ensure deter- 
ministic behavior, one may either randomly select a builder/action using a fixed 
seed (like Randoop [26]) or exhaustively explore all builder/action combinations 
up to a given depth (like EvoSpex [25]). Thus, not finding a valid object that is 
misclassified as invalid by the candidate class invariant does not guarantee the 
absence of one. The effectiveness of finding a misclassified object depends on the 
object state coverage achieved by the random walk. 

The candidate invariant before the second iteration (false) in [Table 1] mis- 
classifies [1__1] obtained directly from the constructor. In contrast, the invariant 
at the start of the fourth iteration (w = 1 Aw = h) misclassifies [22], which 
is obtained after invoking setLength(2) on the freshly constructed object. No 
valid object is misclassified as invalid for the invariant at the start of the fifth 
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Table 2: Accuracy of properties for detecting artificially created invalid 
SimpleSquare objects (e detected, o undetected) 


invoked method/tested property (0__0; (1__0) ri -i á 2 3.2) 


aspectRatio() ° ° o o o 
toRect() ° ° ° o o 

area()>0 ° ° o o o 

perimeter ()>0 ° o ° o o 
aspectRatio()== ° ° o . o 
toRect().toSquare() ° ° ° ° ° 


iteration (w = h Aw > 0). Hence, this invariant is sound and, as we will see 
later, it is also complete. 


3.3 Detecting Invalid Objects via Behavioral Oracles 


An object is considered invalid if it cannot be reached via a random walk. How- 
ever, exhaustive state space exploration is impossible for infinite state spaces 
which occur, e.g., when objects use references to establish unbounded structures 
such as linked lists. Even for finite state spaces as exhibited by the running ex- 
ample, an exhaustive exploration often remains practically infeasible. In general, 
a partial exploration does not provide a sound oracle to determine if a supplied 
object is unreachable. To detect invalid objects, we instead consider behavioral 
oracles that exploit the behavior of the object under analysis exposed upon 
method invocations. We consider two sound but possibly incomplete behavioral 
oracles for detecting invalid objects: random walks and property-based tests. 


Random Walks as Weak Oracles During the random walks used to generate 
valid objects, any thrown expected exception indicates a failed input validation 
and is ignored. Conversely, if an unexpected exception occurs during a walk 
starting from an artificially created object, it implies that all objects along the 
walk, including the initial object, are invalid. The use of random walks for de- 
tecting invalid objects shares similarities with fuzz testing [I7] for identifying 
faulty implementations. In fuzz testing, a program is subjected to a range of 
different input values to cause an observable error [88], indicating a bug in the 
implementation. For a correct implementation, any unexpected exception indi- 
cates an invalid object. While behavioral oracles based on random walk-based 
are sound by construction for detecting invalid objects, they are rarely complete. 

shows the detection results of six properties for five invalid objects. 
The first two properties resemble observer actions during a random walk. Method 
aspectRatio throws a division-by-zero exception if the height is zero, thus de- 
tecting the first two invalid objects. Method toRect creates a new rectangle 
with the same width and height as the current square. The constructor of class 
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SimpleRectangle (not shown) validates the input width and height and throws 
an exception if argument values are not strictly positive, thus subsuming the 
aspectRatio method in terms of its detection capabilities. However, it fails to 
detect objects whose strictly positive width and height differ. 


Property-based Tests as Strong Oracles Property-based tests [9] are a stronger 
behavioral oracle when compared to random walks. Not only can they detect 
invalid objects that throw unexpected exceptions, but they can also interpret 
the absence of an exception and method return values as an indication of object 
invalidity. Because property-based tests operate at a behavioral level, they do not 
require knowledge about internal implementation details. Information regarding 
expected behavior can be found in the documentation of the class under analysis 
and (formal) specifications, e.g., for abstract data types [12]. Because property- 
based tests are assumed to be sound but incomplete, a passing property-based 
test suite does not guarantee the validity of the object under analysis. However, 
a single failed test is sufficient to deem the object invalid. 

The last four properties in [Table 2]resemble candidate property-based tests. 
We may assume that the expected behavior of class SimpleSquare is that the 
area and the perimeter must be greater than zero and that the aspect ratio 
must be equal to one. In addition, the translation from a square to a rectangle 
and back to a square should be possible without raising an exception. Observe 
that the area property detects invalid objects with either the width, height or 
both equal to zero. The perimeter property detects those invalid objects where 
the sum of width and height is not strictly positive. Note that the aspect ratio 
property, in addition to its corresponding observer action, detects some states 
(due to integer division) where w and h differ. The last property subsumes its 
associated observer action and detects all invalid objects. 


3.4 Generating Invalid Objects via Bounded-Exhaustive Testing 
Techniques 


By considering invalid objects, we can not only check if the invariant is com- 
plete, i.e., sufficiently restrictive, but also automatically identify equivalent as- 
sertions [1]28]. While misclassified valid objects found during weakening widen 
the scope, misclassified invalid object found during strengthening narrow it. 
Acquiring a representative set of invalid objects is a non-trivial task. Existing 
assertion learning approaches primarily derive possibly invalid objects by exe- 
cuting a mutated program [15]23[30] or by mutating valid program states [2529]. 
Nevertheless, these approaches often assume the derived object state to be in- 
valid without conducting further validation. Consequently, the quality of the 
learned assertion is compromised if a valid object state is mistakenly labeled as 
invalid. Using generators for complex test inputs from bounded-exhaustive test- 
ing (BET), such as Korat [3]21], enables the artificial creation of a large number 
of (in)valid object states. We combine these generators with behavioral oracles, 
and contrary to the conventional practice in BET of retaining only valid objects, 
we retain only those objects that are classified as invalid. Behavioral oracles can 
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also be applied to objects constructed using program or state mutation; however, 
we favor the complex test input generators from BET because they produce a 
larger and more representative set of invalid objects. 

The five invalid object states displayed in[Table 2]are included in the output of 
a bounded-exhaustive object state generator when supplied with a lower/upper 
bound of -1/3 on integer values. The invalid objects (0_ _0; and (1_ _0) are suitable 
for strengthening the candidate invariant. 


3.5 Invalid Object Integration 


Our approach generates new assertions on-the-fly in order to integrate so far 
misclassified invalid objects and classify them correctly. Each assertion is evalu- 
ated in the context of an object of the class under study. The following assertion 
grammar suffices for our running example: 


Int n= Ol! 
Bool true | false | Int = Int | Int > Int 
Int t wlh 


II 


The first two rule fragments reason about integer and boolean values, while 
the last fragment provides access to the attributes of a SimpleSquare object. 
Terminals such as “1” or “>” denote constants or operators, and non-terminals 
such as Int are types. Symbol ::=* indicates that we supplement a non-terminal 
with new rules. 

The invalid object integration step is performed after strengthening or weak- 
ening. In the former case, a single invalid object is provided, while in the latter 
case there may be multiple or no invalid objects. In case of a single misclassified 
invalid object, we search for an assertion that classifies the said object as invalid, 
but does not classify any previously collected valid object as invalid. For multiple 
invalid objects, we iteratively search for a suitable assertion. 

Our invalid object integration step can be substituted with any model learn- 
ing approach that accepts valid and invalid object states as input. While neural 
networks [24] and support vector machines [80] generally achieve high accuracy, 
their black-box nature makes them less ideal for program comprehension. In con- 
trast, decision tree models [2| offer interpretability, but their internal disjunctive 
encoding is disparate to how developers express class invariants in code, usu- 
ally as a sequence of assert statements. Hence, we favor conjunctive models for 
modeling class invariants in the context of comprehending object states, because 
they are interpretable and align with how invariants are phrased in practice. 


Caching Suitable Assertions An unsuitable assertion either incorrectly detects 
a valid object or does not detect the candidate invalid object. Because our ap- 
proach only adds objects and never removes existing ones, an assertion that 
incorrectly detects a valid object is not only unsuitable to integrate the cur- 
rently misclassified invalid object but also for any future one. In contrast, an 
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Fig. 4: The behavioral oracle aspectRatio() and the assertion w = h both 
detect the invalid object {1_ 0, but classify other objects differently. 


assertion that satisfies all valid objects and the misclassified invalid object may 
still be suitable in the future. 

Our caching mechanism only stores assertions that satisfy all valid objects. 
For example, after observing [1 __1] we store the assertion true in the cache, but 
we do not store false. 


Preventing Equivalent Assertions Our approach only adds assertions to distin- 
guish invalid from valid objects, which prevents the generation of equivalent 
assertions. This strategy exploits observational equivalence [128], which creates 
equivalence partitions among assertions based on the values to which they eval- 
uate. Because our approach only adds an assertion if the existing assertions 
cannot distinguish an invalid object from the valid objects, the added assertion 
is observationally inequivalent to any existing assertion. This property remains 
true because we only add (in)valid objects, thus refining this notion of equiva- 
lence. For example, false and w=/ are considered to be equivalent with respect 
to (0_ _0j, but are inequivalent when also considering [11]. 

Observational equivalence cannot be used for approaches that only consider 
valid objects [8]27J34], because all suitable assertions are deemed equivalent. 
Instead, these approaches require static analysis to detect equivalent assertions. 


Inexpressive Assertion Grammars If the assertion grammar for the example in 
[Figure 4] would only be capable of generating the assertion w = h , then the 
invalid object (0. _0; cannot be integrated. This invalid object is said to be indis- 
tinguishable from the valid objects such as [1 1] with respect to the employed 
assertion grammar. Because our collected objects are proven (in)valid, indistin- 
guishability can only be resolved by increasing the grammar’s expressiveness. 
Instead, we continue learning but label the class invariant as approximate, which 
ensures that it is overly permissive and, thus, remains sound. Note that once 
the candidate class invariant becomes approximate, it remains so. However, an 
overly permissive invariant is still useful for program comprehension, because a 
subsequent manual invariant refinement only needs to add assertions. 


Outperforming the Behavioral Oracle Our approach does not learn an invariant 
from a single complete oracle, utilizes two sources of sound information: behav- 
ioral oracles for invalid objects and random walks for valid objects. This can 
result in invariants that improve upon the accuracy of the underlying behavioral 
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oracle. For example, the oracle aspectRatio() in [Figure 4] detects the invalid 
object [1__0}, which can be integrated by adding the assertion w = h to the 
candidate class invariant. Note that this assertion also detects the invalid object 
(1-1) that is not detected by the oracle. 


Qualities of Learned Class Invariants The quality of our learned class invariants 
depends on the expressiveness of the assertion grammar, the accuracy of the be- 
havioral oracle, and the object state coverage achieved by the random walk for 
generating valid objects and the bounded-exhaustive object state generator for 
generating potential invalid objects. While an inexpressive assertion grammar 
may be detected during learning, an incomplete oracle or an insufficient ob- 
ject state coverage cannot be detected. Accordingly, no soundness/completeness 
guarantees can be given for a learned non-approximate class invariant except 
that it correctly classifies all collected (in)valid objects. Approximate class in- 
variants classify some of the collected invalid objects as valid, which still aids 
comprehension in the presence of an inexpressive assertion grammar. 

Learning a complete invariant that also correctly classifies so far unseen ob- 
jects is only possible if the assertion grammar is sufficiently expressive, the 
behavioral oracle is complete, and the object state coverage is sufficient, e.g., 
exhaustive for finite object state spaces. 


4 Evaluation 


To evaluate our class invariant learning approach, we have implemented the 
prototype tool Geminus for Java. Our bounded-exhaustive object state generator 
uses the Java Reflection API to modify the internal object state and prevents the 
generation of symmetric object states in the style of [2I]. Our grammar-based 
assertion generator performs an explicit top-down enumeration and generates 
strings representing native Java expressions, which allows for a simple grammar 
definition. We use the Java JShell to dynamically compile these strings into 
executable lambda expressions at runtime. 
Our experiments focus on the following research questions: 


RQ1 How do random walks and property-based tests compare to a ground-truth 
class invariant in terms of detecting invalid objects? 

RQ2 What is the disparity between the class invariant learned by Geminus and 
the employed behavioral oracle? 

RQ3 How does the accuracy of the class invariant(s) learned by Geminus, de- 
tected by Daikon, and documented as invariant validation methods differ? 


4.1 Benchmark Composition 


Our benchmark contains several dynamic data structures, whose implementa- 
tions exhibit complex invariants. In addition, the corresponding classes are one of 
the few in the Java collections framework that contain state validation methods. 
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From the evaluation examples of Daikon [8], we pick StackAr and QueueAr, 
which were adapted from [87] and provide an array-based implementation of 
a stack and queue, respectively. The majority of our dynamic data structures 
originate from the Java collections framework java.util. Class ArrayList and 
legacy class Vector both provide a linear collection via an array-based implemen- 
tation. In addition, class LinkedList provides Deque/Queue functionalities via a 
linkage-based implementation, while class ArrayDeque uses an array-based im- 
plementation. Class PriorityQueue handles comparable elements via an array- 
based priority heap, and class BitSet offers a memory-efficient bit vector. 

For verification, a class invariant needs to be strong enough to prove an 
assertion. In our learning setting, we search for a class invariant that correctly 
classifies all reachable objects as valid and all unreachable objects as invalid. 
Depending on the verification task, the class invariant required for this may be 
weaker than the invariant we aim to learn. Accordingly, the manually specified 
ground-truth invariants for evaluating each benchmark item must be as strong as 
possible. Thus, the number of benchmark items is primarily limited by the cost 
of manually specifying these strong class invariants. Evaluating our approach on 
further data structures, including Maps and Sets, is left for future work. 

To evaluate our approach, we have instantiated a random walk and bounded- 
exhaustive generator for each benchmark item and have written property-based 
tests using the provided documentation. We configure the assertion grammar 
to include binary operators among integers (+, -, ==, !=, >=, >), object iden- 
tity, range null checks in arrays, and the ternary operator (c?b: true) to encode 
implications. Extending the grammar with additional operators, such as mul- 
tiplication or division among integers, is straightforward and may improve the 
expressiveness of the grammar. However, the increase of assertions expressible 
in the grammar may lead to timeouts during assertion synthesis. For our exper- 
iments, we limit assertion generation to a maximum of 75 000 assertions. 


4.2 Evaluation Results 


Our results in[Table 3]show the number of valid (val.) and invalid (inv.) objects 
produced by the bounded-exhaustive generator for our ground-truth invariant, 
which contains A assertions. Because random walks (RW) and property-based 
tests (PBT) are sound, i.e., all objects classified as invalid are guaranteed to be 
invalid, we only report false-negatives (FN), i.e., the number of invalid objects 
that remain undetected. As a behavioral oracle, our random walks have a walk 
length and a walk count of 50. Increasing the walk length and count may improve 
detection accuracy, but at the cost of increased computation time. 

Our evaluation results in[Table 4]report on the accuracy of the class invariant 
learned by Geminus using random walks or property-based tests as oracle, the 
class invariant detected by Daikon in its default configuration, and the invariant 
validation method documented in the source code (Doc). Geminus and Daikon 
receive the same set of valid objects derived from deterministic random walks 
with both a walk length and a walk count of 500, respectively. Analogously 
to using random walks as oracles, increasing the walk length and count may 
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Table 3: Accuracy comparison in detecting invalid objects using manually writ- 
ten ground-truth class invariants, random walks, and property-based tests; best 
results are highlighted in bold. 


Ground-truth RW PBT 
Item 
val. inv. A FN FN 
SimpleSquare 10 431 2 90 0 
StackAr 4097 4095 3 0 0 
QueueAr 322 10678 13 3078 152 


PriorityQueuve 1918 154954 8 63149 36708 
BitSet 2047 40 961 6 19099 18434 
ArrayList 4083 38 925 4 16398 16398 
Vector 4083 38 925 4 16398 16398 
LinkedList 4 38 335 4 4 (0) 
ArrayDeque 385 345727 12 169593 0 


further improve the object state space coverage in terms of valid objects, but 
at the cost of increased computation time. In addition, Geminus derives invalid 
objects from the bounded-exhaustive object state generator using its respective 
oracle. We only report false-positives (FP) for Daikon, because the invariants 
learned by Geminus classify all valid object as valid in our experiments. We 
report the computation time (t) in seconds. All experiments were conducted on 
an Apple MacBook Air M2 with 16 GB RAM. 

Regarding threats to validity, we manually examined the source code of the 
benchmark items to define the ground-truth class invariant. To mitigate the risk 
of specifying an overly restrictive invariant, we validated it against the objects 
visited by our random walk. To address threats to internal validity that may 
arise from random walks, we fixed the random number generator’s seed to ensure 
that the same objects are generated during each walk. Furthermore, we excluded 
probabilistic data structures like skip lists [32] from the benchmark to ensure 
identical internal object states. 


4.3 Oracle Accuracy Comparison 


When used as a behavioral oracle, random walks detect numerous invalid object 
states in our experiments. They exhibit comparable accuracy to property-based 
tests for benchmark items StackAr, ArrayList, and Vector. Additionally, ran- 
dom walks identify a significant portion of invalid objects for LinkedList. The 
majority of unexpected exceptions arise from null dereferencing or accessing out- 
of-bounds indices in arrays. Random walks cannot assess whether the retrieved 
elements from a PriorityQueue are in the correct order. The documentation 
states that retrieving the first element from an ArrayDeque throws an exception 
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Table 4: Comparing the accuracy in detecting invalid objects using the class in- 
variant learned by Geminus, detected by Daikon, and invariant validation meth- 
ods documented in the code; best results are highlighted in bold. 


Item Geminus+RW Geminus+PBT Daikon Doc 
FN t A OO FN t AOO FN FP t A FN A 
SimpleSquare 0 3 2 2 2 0 4 2 1 2 1 0 7 2 = = 
StackAr 0 62 3 4 0 5 2 3 4 0 0 7 4 = 


QueueAr 2229 7 4 13 12 542 93 15 39 50 2513 09 9 = = 


PriorityQueue 62545 31 2 4 5 9277 298 3 11 11 112462 0 32 6 == 
BitSet 18434 11 2 3 4 18434 9 2 3 4 55 2036 45 3 0 3 
ArrayList 16398 10 2 3 4 16398 10 2 3 4 16398 0 53 3 7181 2 
Vector 16398 21 2 3 4 16398 21 2 3 4 16398 0 49 4 7181 2 

LinkedList 015 10 4 29 O 15 10 4 29 
0 4 26 16 729 1 

LinkedList* 010 2 3 2 0 10 2 3 2 
ArrayDeque 98966 74 5 6 9 O 60 8 23 24 169593 0 23 7 30079 7 


if the structure is empty, but random walks cannot detect cases where the queue 
is considered empty, yet a retrieval does not throw an exception. 

The property-based tests fail to identify some invalid objects for five items. 
BitSet, ArrayList, and Vector implementations nullify unused array elements 
to aid garbage collection, which does not affect functional behavior. However, 
our tests, which focus on functional behavior, cannot detect objects violating 
this property. Random walks can also only uncover faults related to functional 
behavior. In the case of StackAr, where the ground-truth class invariant is lim- 
ited to functional aspects only, both our tests and the random walks detect all 
invalid objects. For PriorityQueue, polling the first element involves a sift-down 
operation, partially repairing an invalid object state. In contrast, a QueueAr with 
a capacity of zero is considered both empty and full simultaneously, leading any 
method to return immediately, and concealing the remaining state. This is a 
known debugging scenario [88], where a bug can lead to an invalid object state 
without necessarily causing an observable error. 

Regarding RQ1, our benchmark in leads to the conclusion that 
property-based tests outperform random walks in terms of accuracy. Further- 
more, we observed that the remaining undetected invalid objects either do not 
affect functional behavior or are partially repaired during method invocation, 
rendering their detection challenging. 


4.4 Disparity between Learned Invariants and Leveraged Oracles 


Using random walks as behavioral oracles, Geminus learns and often surpasses 
the accuracy of the oracles in our experiments. Although our random walks do 
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not detect all invalid objects for class SimpleSquare (see[Table 2), Geminus still 
manages to learn the correct class invariant. The accuracy of the learned class 
invariant depends on the assertion grammar and the order in which candidate 
assertions are generated. For SimpleSquare, assertions w = h and w > 0 are 
generated before assertions w > 1 and h > 1, which would also resolve all 
misclassified objects found by the random walk oracle. 

Using property-based tests as the oracle, Geminus learns an approximate 
class invariant for class PriorityQueue and ArrayDeque. The current asser- 
tion grammar is not sufficiently expressive to generate a parametrized assertion 
such as queue [(i-1)/2] . compareTo (queue [i] )<=0, which is required for item 
PriorityQueue. Nevertheless, the learned invariant is more accurate than the 
underlying oracle. In contrast, Geminus learns a less accurate class invariant for 
QueueAr. While the assertion grammar is expressive enough to generate a suit- 
able assertion with multiple conditions that resolves the indistinguishability, the 
current assertion limit is insufficient in this case. 

Regarding RQ2, our benchmarks in [Tables 3]and [4]demonstrate Geminus’s 
ability to learn a class invariant that outperforms the oracle, resulting in a lower 
number of false-negatives. Both cases of approximate invariants are due to the 
inability of the assertion grammar to generate suitable assertions. To gener- 
ate parametrized assertions, the assertion grammar needs to be extended with 
lambda expressions. To better support assertions with multiple conditions, which 
would pave the way for analyzing more complex Java projects, we plan to re- 
place our conjunctive assertion model with a conjunctive normal form model for 


model training (cf. [Section 6). 


4.5 Comparing Geminus, Daikon, and Invariant Validation Methods 


Daikon {8| generates assertions using templates and retains only those assertions 
that hold for valid objects. It performs equally well for simple data structures 
like StackAr, but it generates less accurate class invariants for other benchmark 
items. For SimpleSquare, it identifies the incorrect invariant w = h Aw > 0, 
which fails to detect [0_ _0}. While [20] excludes unqualified calls, Daikon con- 
siders them, which may result in learning an overly permissive invariant. In 
contrast, Geminus considers qualified calls only and learns the correct invariant. 

The invariants learned by Geminus may produce false-positives, but never 
did so in our experiments. The invariants documented in the state validation 
methods also produce no false-positives, as anticipated. However, Daikon does 
report false-positives for BitSet and LinkedList. For BitSet, this is due to 
the random walk configuration inadequately representing the object state space, 
which leads Daikon to retain the overly restrictive assertion words[] elements 
>= 0, encoding that all array elements are greater than or equal to zero. Because 
Geminus solely adds assertions to detect previously undetected invalid objects, 
it learns the correct invariant in this example. While this mechanism proves 
advantageous when dealing with unrepresentative valid objects, Geminus relies 
on a representative set of invalid objects. 
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The LinkedList class uses a doubly-linked list structure with prev and next 
attributes. Daikon detects assertions aiding program comprehension, but it lacks 
the necessary guards to avoid false-positives. While Daikon only considers valid 
objects and thus does not require an additional oracle to detect invalid ob- 
jects, it may learn overly permissive invariants. For example, Daikon identi- 
fies the doubly-linked style through the first == first.next.prev assertion. 
However, it overlooks the need for a guard to prevent null dereferencing. Iden- 
tifying necessary assertions containing guards is a challenging task when only 
valid objects are available. Considering invalid objects assists Geminus in finding 
the necessary assertions, like first != last ? first == first.next.prev : 
true. Despite its recursive structure, Geminus learns an invariant that accurately 
detects all invalid objects. This is possible because the bounded-exhaustive ob- 
ject state generator only covers object states for LinkedList, including up to 
three list nodes. Note that linkage-based classes exhibit large object state spaces 
even for a small number of linked elements, which is due to reference aliasing. 
While the documented validation method accurately characterizes the case of an 
empty list, it imposes an overly permissive constraint for non-empty lists, namely 
first.prev == null && last.next == null. The crucial constraint that the 
previous attribute of the next node is the current node is not documented. 


The linearization [7] technique maps a linkage-based structure to an array 
representation. We can enrich our grammar with the closure abstraction to store 
the objects that are reachable from a given object, using a specific attribute 
in an array. While the linearization in is used to reason about the values 
stored in a list, this closure abstraction allows one to characterize the double 
linkage structure by expressing that the closure from the first element via the 
next attribute is reverse to the closure from the last element via the prev 
attribute. In LinkedList*, Geminus uses this grammar to learn an invariant 
that generalizes to lists of arbitrary length. 


The invariant validation methods for BitSet, ArrayList, and Vector require 
null elements at the next free array location, while our ground-truth checks all re- 
maining locations. Both constraints do not affect the functional behavior and are 
thus not detectable by our oracles. In practice, invariants ensuring a functionally 
equivalent behavior typically suffice. Similarly, ArrayDeque requires elements in 
the queue to be different from null. It concludes from a null value when fetching 
the first/last element that the queue is empty. The documentation mentions that 
all non-live elements in the array are null, but this is only partially checked in 
their checkInvariants method, leading to numerous undetected invalid objects. 


Regarding RQ3, our benchmark in demonstrates that Geminus 
learns more accurate invariants when using the more accurate property-based 
tests as oracle, instead of the random walk oracle. Moreover, it often outperforms 
Daikon in terms of accuracy. Unlike Daikon, our tool identifies necessary guards 
for complex object states most of the time, avoiding overly permissive or incorrect 
invariants. Notably, Geminus achieves greater accuracy than the documented 
validation methods, especially for the complex object states of LinkedList or 
ArrayDeque. 
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5 Related Work 


This section contrasts our dynamic class invariant learning to related dynamic 
assertion learning approaches. 

Daikon [8] exhaustively instantiates its assertion templates and retains only 
those assertions that hold for all observed states at desired program locations. 
In contrast, Geminus uses the first assertion that suffices to detect a so far 
misclassified invalid object. Because Daikon considers valid objects only, it relies 
on static analysis to prune overly permissive, equivalent, or redundant assertions. 
In contrast, Geminus employs invalid objects to exclude such assertions, which 
allows us to consider a much larger set of candidate assertions. 

PIE [27] learns preconditions and loop invariants from (in)valid objects and 
uses a feature grammar to construct assertions in conjunctive normal form on- 
the-fly; however, Valiant’s algorithm [36] limits PIE to small formulas. While PIE 
requires a postcondition to correctly label the set of predefined program states 
during learning, Geminus uses behavioral oracles to detect invalid objects. 

Alearner [30] derives preconditions and uses a test suite to detect invalid 
method inputs. While Geminus keeps the object graph of each (in)valid example, 
Alearner only stores an abstraction, which limits precondition expressiveness and 
hinders manual inspection of training data. Alearner uses program mutation to 
obtain potentially invalid object states, but does not validate this assumption. 

OASIs [I5] assesses soundness and completeness of an assertion located 
within the program. Similar to our random walks, OASIs generates execution 
scenarios to identify overly restrictive assertions. It uses mutation testing to 
deem an assertion overly permissive; however, this technique cannot be applied 
to class invariants, because they cannot be mapped to a single program location. 
GAssert [35] uses OASIs to evaluate the quality of an assertion and enhance it 
for soundness, completeness, and assertion size using an evolutionary learning al- 
gorithm. Its evolutionary technique can be an alternative to our grammar-based 
assertion enumeration, but necessitates defining evolutionary operators. 

Proviso [2] addresses, like Geminus does, complex object states, but learns 
preconditions from observer methods. In contrast, Geminus learns class invari- 
ants from private attributes. While Proviso uses a test generator to obtain 
(in)valid argument values, invalid object states cannot be derived in this way. If 
no distinguishable feature can be constructed, Proviso relabels valid objects as 
invalid. Geminus’ objects are guaranteed to be (in)valid. 

Hanoi [22] and Geminus both learn invariants from (in)valid objects. While 
Hanoi’s notion of constructible value bears similarity with random walks, their 
invalid objects are not proven invalid and must be recomputed after finding a 
new so far misclassified valid object. Hanoi learns representation invariants for 
types in a functional language and constructs a single definition that captures 
the recursive structure of the type. In contrast, Geminus iteratively refines a set 
of assertions to learn the invariant of a class in an object-oriented language. 

EvoSpex [25] employs an evolutionary algorithm, but learns postconditions 
from (in)valid pre/post state pairs. Invalid pairs are obtained via state mutation, 
which does however not necessarily yield invalid states. Geminus solves this 
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problem for class invariants using behavioral oracles, and only considers thereby 
proven invalid states. While Geminus utilizes Java expressions, EvoSpex encodes 
assertions in the Alloy language [14]. The assertion enumeration component in 
Geminus is language agnostic and can be replaced with, e.g., Alloy. 

SpecFuzzer tackles the problem that inferred specifications often contain 
equivalent assertions. It uses Daikon to remove overly restrictive assertions and 
then applies program mutation to derive possibly invalid states in order to con- 
struct equivalence partitions among the remaining assertions. Geminus prevents 
the generation of equivalent assertions, similar to SpecFuzzer, via observational 
equivalence reduction [IJ28]. While equivalence partitions can be constructed 
without knowing whether a state is valid or invalid, guaranteed to be invalid 
states allow us to assess whether an invariant is sufficient. Geminus generates 
new assertions until a suitable assertion that detects an invalid state is found. 


6 Conclusions 


To ensure that modifications to legacy software conform to existing assumptions, 
it is essential to make implicit guarantees explicit, e.g., in the form of method pre- 
conditions and class invariants. However, class invariants encoding object state 
assumptions are rarely documented and almost never checked automatically. 

In this paper, we presented a dynamic analysis for class invariant learning 
that automatically derives (in)valid objects and distinguishes between them by 
grammar derived assertions. We leverage random walks in object state spaces 
to find valid objects and a combination of complex test input generators from 
bounded-exhaustive testing with behavioral oracles to find invalid objects. In 
this setting, random walks can even be reused as behavioral oracles. Our pro- 
totype tool Geminus improves upon related tools such as Daikon by learning 
invariants for complex classes, such as dynamic data structures included in the 
java.util package, resulting in a higher accuracy in detecting invalid objects. 
Considering invalid objects, too, allows Geminus to prevent the generation of 
equivalent assertions, thereby leading to concise invariants without the need for 
static assertion equivalence checks. 

The capabilities of dynamic class invariant learning approaches primarily rely 
on finding so far misclassified (in)valid objects and training a suitable invariant 
model. While finding execution paths that result in a representative set of valid 
objects is well understood in the context of software testing, finding represen- 
tative invalid objects is studied less and should be in the focus of future work. 
Sampling object states while executing a mutated program is likely a source for 
potentially invalid objects worth to be explored. Our conjunctive assertion model 
struggles to scale with respect to invariants containing multiple guards per as- 
sertion. Future work should focus on crafting heuristics for learning formulas in 
conjunctive normal form to model complex class invariants with multiple guards. 


Data-Availability Statement The source code of Geminus, the benchmark 
items, the evaluation results and instructions for reproduction are available on- 


line via DOI 10.5281 /zenodo.10514765 
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Abstract. Given the scale and complexity of large online service sys- 
tems and the diversity of environments in which the services are to be 
invoked, it is inevitable that those service systems contain bugs that 
affect the users. As a result, it is essential for service providers to dis- 
cover issues in their systems based on information gathered from users. 
iFeedback is a state-of-the-art technique for user-feedback-based issue 
detection. While it has been deployed to help detect issues in real-world 
service systems, the accuracy of iFeedback’s detection results is relatively 
low due to limitations in its design. In this paper, we propose the SKYNET 
technique and tool that analyzes both user feedback gathered via spe- 
cific channels and public posts collected from social media platforms to 
more accurately detect issues in service systems. We have applied the 
tool to detect issues for three real-world, large-scale online service sys- 
tems based on their historical data gathered over a ten-month period of 
time. SKYNET reported in total 2790 issues, among which 93.0% were 
confirmed by developers as reflecting real problems that deserve their 
close attention. It also detected 58 out of the 62 severe issues reported 
during the period, achieving a recall of 93.5% for severe issues. Such 
results suggest SKYNET is both effective and accurate in issue detection. 


1 Introduction 


Large-scale online service systems are becoming indispensable for people’s work 
and everyday life nowadays. They also get more and more complex so as to 
support the ever-growing needs of their users for new and more powerful func- 
tionalities. The scale and complexity of such services as well as the diversity of 
environments in which the services are to be invoked, however, have made it 
more challenging than ever for developers to make sure the services will always 
behave as expected. Despite the tremendous amount of time and effort devel- 
opers invest in testing and debugging such online service systems, it is almost 
inevitable that some bugs escape the developers’ attention, get released into the 
field, and negatively impact users’ experience with the services. It is, therefore, 
extremely important for the service providers to discover issues in their systems 
based on information gathered from users in a timely manner. 
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In view of that, Zheng et al. [45] recently proposed the iFeedback approach to 
detecting issues based on user feedback. While the approach has been deployed 
to help detect issues in large-scale online service systems and has successfully 
detected severe issues, the overall precision of its results is relatively low, 76.2% 
to be exact [45]. We conjecture there are three reasons for that. First, iFeedback 
extracts word combinations from feedback texts as indicators of issues. Since 
word combinations only capture the lexical, rather than semantical, character- 
istics of feedback texts, they, as issue indicators, tend to be overly sensitive to 
the wording of user feedback. Second, iFeedback detects anomalies at the level 
of time intervals based on all the user feedback gathered during those intervals, 
which is too coarse-grained. Since a wide range of different types of user feed- 
back, concerning issues or not, may get reported during each time interval, it 
is more likely for iFeedback’s judgment to be influenced or even misled by user 
feedback that does not report any issues. Third, iFeedback applies an unsuper- 
vised algorithm to cluster the feedback during anomalous time intervals based 
on the word combinations and their contexts. While unsupervised clustering al- 
gorithms are less expensive to apply, they tend to produce less precise results 
than supervised algorithms in general [36]. 


To address these limitations of iFeedback and improve the quality of issue 
detection results, we propose in this paper a novel approach, named SKYNET, to 
automatically detecting issues in online service systems based on multi-channel 
user input, including both user feedback and messages posted on social media 
platforms. More concretely, SKYNET first employs a cascading classifier to label 
the user feedback texts based on an input hierarchical label system for different 
types of user experiences. Then, it applies time-series data analysis to predict, 
based on historical data, a threshold for the normal frequencies of user feedback 
reporting each known type of negative user experience; and it reports an issue 
when more feedback of the same type than allowed by the threshold is gathered 
from the users. Meanwhile, for user feedback reporting negative experiences of 
previously unknown types, SKYNET reports an issue when an abnormous amount 
of such user feedback concerns similar negative user experiences. The semantic 
embedding of feedback texts and the customized issue detection process adopted 
by SKYNET enables it to detect more real issues in service systems and to prune 
out most false positives. In view that social media platforms have become im- 
portant and popular venues for users to share their experiences with various 
services and products, SKYNET also monitors and analyzes messages posted on 
social media platforms to detect issues before they generate a large number of 
user feedback or attract considerable unwanted public attention. 


We have implemented the SKYNET approach into a tool with the same name. 
To empirically evaluate SkyNET’s effectiveness, we applied it to detect issues for 
three real-world, large-scale online service systems based on their historical data 
gathered from a ten-month duration. SKYNET reported in total 2790 issues, 
93.0% of which were confirmed by operators and developers as reflecting real 
problems that deserve their close attention. Besides, SKYNET was able to detect 
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58 of the 62 severe issues that occurred during that period of time. Such results 
suggest SKYNET is highly effective and accurate in issue detection. 
Contributions. This paper makes the following contributions: 


— We propose the SKYNET technique that analyzes both user feedback gath- 
ered from specific channels and public posts collected from social media 
platforms to accurately detect issues in large-scale online service systems. 

— We develop SKYNET into a tool with the same name. 

— We empirically evaluate SKYNET by applying it to detect issues for three 
real-world service systems based on historical data. The results produced 
suggest that SKYNET is highly effective and accurate. 


2 Related Work 


Our work is closely related to existing work in the following areas. 

Anomaly detection based on backend monitoring. In view that many 
issues in online service systems affect performance attributes like “disk queue 
length” and “network retransmission rate” of the backend systems, people often 
monitor the corresponding key performance indicators (KPIs) of the systems and 
rely on the values to detect anomalies in those services ; 
For instance, Laptev et al. [2I] proposed the EGADS system that combines 
a collection of anomaly detection and forecasting models to detect anomalies 
in time-series KPI data. Liu et al. proposed the Opprentice system that 
trains a random forest with labeled KPI features to select appropriate param- 
eters and thresholds for existing detectors. Xu et al. [44] proposed an unsuper- 
vised anomaly detection algorithm, named Donut, to effectively detect anomalies 
in seasonal KPIs. Given that online service systems automatically generate is- 
sue reports and alerts when the monitored indicators exhibit anomalous values, 
techniques have also been developed to mine attribute collections of issue re- 
ports [15/24] to characterize and detect incidents [22]. 

Issue detection based on user feedback. Many issues, e.g., user interface 
defects and silent back-end issues, in those systems, however, are not reflected by 
pre-defined KPIs [45]. In view of that and the fact that user opinions coming in 
different forms (e.g., user feedback, tweets, and forum posts) contain valuable in- 
formation to support software development and maintenance [12[13/29)30/41/42), 
Zheng et al. [45] proposed the iFeedback approach to detecting issues based on 
user feedback on-the-fly. iFeedback first extracts word combination-based indi- 
cators to represent an issue and collects each indicator’s historical occurrence 
trend (HOT), then the long-term and short-term windows of the HOTs are fed 
to a binary classifier to identify anomalous time intervals, and in the end, user 
feedback from time intervals containing issues are clustered as reporting different 
issues. SKYNET improves on iFeedback from three perspectives. First, iFeedback 
extracts word combinations from feedback texts as indicators of issues, which 
captures only the lexical characteristics of feedback texts, while SKYNET em- 
ploys the ALBERT-tiny model to encode user feedback so that the semantics 
of user feedback can be taken into account during the issue detection process. 
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Fig. 1: An overview of the issue detection process with SKYNET. 


Second, iFeedback detects anomalies at the level of time intervals based on all 
the gathered user feedback, which is often too coarse-grained and increases the 
chance of coincident non-issue-reporting feedback influencing and misleading the 
issue detection process. In contrast, SKYNET employs a cascading classification 
algorithm to label user feedback based on a hierarchical label system and only 
takes feedback that reports negative user experiences into account in the re- 
maining issue detection process. Third, SKYNET also monitors and analyzes 
messages posted on social media platforms to detect issues in a timely manner, 
which complements user-feedback-based issue detection. 

Learning from user opinions in other forms. User opinions in other 
forms have also been utilized to support various types of activities in software 
development. Gao et al. [I4| proposed the IDEA framework that detects issues 
from review texts of apps. Stanik et al. [88] proposed an approach to iden- 
tify aspects of software systems to improve based on user comments received 
on Twitter. While those identified aspects may indeed need improvement, they 
not necessarily are issues in the corresponding software systems. Guzman et 
al. [I6] proposed the ALERTme approach that automatically classifies, groups, 
and ranks tweets to facilitate the analysis of application-related tweets. Williams 
and Mahmoud [43] conducted a study on leveraging Twitter as a main source 
of software user requirements. Johann et al. proposed the SAFE approach 
that extracts keywords from app feature descriptions written by developers and 
app reviews on app stores to better characterize the apps. Compared with these 
works, SKYNET focuses on detecting issues in online service systems based on 
user feedback and social media posts. 


3 The SkyNet Approach 


Figure[l|depicts an overview of the issue detection process with SKYNET. SKYNET 
leverages deep learning algorithms to detect issues based on multi-channel data 
and it combines two loosely coupled processes: The main process is designed 
for detecting issues based on user feedback texts gathered through dedicated 
channels that are embedded in the service systems, while the auxiliary process 
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complements the main process and aims to detect issues using posts collected 
from social media platforms. Each issue detected by SkyNet is associated with 
a collection of user feedback, a social media post in case it is the main concern 
of the post, and a list of ten keywords extracted from the user feedback and 
post using the TF-IDF method [6]. While the keywords help provide a rough 
idea about an issue, developers must examine the associated user input to de- 
termine whether the reported issues reflect real problems in the service systems. 
In the rest of this section, we explain in detail the steps in SkYNET’s main and 
auxiliary issue detection processes. 

Note that, as in other model-based approaches, we periodically review the in- 
put user feedback and social media posts as well as the detected issues, manually 
rectify the incorrect detection results if any, and use the new data to fine-tune the 
models that SKYNET utilizes so as to keep the models fit for the updated business 
situation and to prevent model degradation. Also note that, although sometimes 
users include images in their feedback and social media posts to help explain the 
problems they have encountered, SKYNET does not utilize such information in 
its current implementation. We leave the development of new techniques that 
exploit the extra image information to facilitate issue detection for future work. 


3.1 Hierarchical Classification of User Feedback 


The first step in issue detection with SKYNET is to decide the type of user ex- 
perience that each piece of the gathered user feedback reports. SKYNET makes 
such decisions on the basis of a hierarchical label system, where the labels char- 
acterize with different levels of detail the types of (negative) user experiences 
that users report in their feedback. 

SKYNET differentiates three broad categories of user feedback in issue de- 
tection, namely feedback reporting negative user experiences of a known type, 
feedback reporting negative user experiences of unknown types, and feedback 
not reporting negative user experiences. User feedback from the first two cate- 
gories is collectively called negative experience reporting feedback. Note that not 
all negative user experiences are caused by issues in service systems. For exam- 
ple, although a user’s access to an online service will be blocked if her device 
is offline due to a hardware failure, the experience does not indicate anything 
problematic in the online service system. 


Feedback Encoding Since SKYNET is designed to detect issues in large-scale 
online service systems, and it may need to process a large number of user feed- 
back under tight time constraints, we use ALBERT-Tiny to encode the 
user feedback. BERT is a pre-trained state-of-the-art language representa- 
tion neural network model with strong semantic comprehension capability. AL- 
BERT [20] is a lite BERT architecture, and it lowers the memory consumption 
and increases the training speed of BERT, while without significantly sacrific- 
ing BERT’s semantic comprehension ability, by sharing parameters across layers 
and reducing embedding dimensions of words. ALBERT-Tiny is the smallest 
version of ALBERT that is 10x times faster than BERT for inference. 
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Fig. 2: A sample hierarchical label system (in blue) and some examples of the associated 
user feedback. 


Hierarchical Label System To correctly decide which type of user experience 
each user feedback reports is crucial since incorrect decisions made here may 
mislead the downstream steps and cause the whole task of issue detection to 
fail. SKYNET employs an existing hierarchical label system to facilitate making 
those decisions. In the system, each label corresponds to a particular type of 
user experience that users may have with the target online service system. 
Designing a label system to properly characterize user experiences is a chal- 
lenging task. SKYNET adopts a hierarchical, rather than flat, label system mainly 
because it is extremely difficult, if not impractical, to decide a priori on the right 
granularity level for the labels in a flat system so as to strike a good balance 
between the accuracy and the value of the classification results based on that 
label system. On the one hand, a coarse-grained label system often makes it 
easier for a classifier to correctly label the input data, but the classification re- 
sults may not be very useful since each label encodes little extra information. 
On the other hand, a fine-grained label system typically makes it harder for a 
classifier to correctly label the input data, but a correct label in this case can be 
highly valuable since it encodes abundant extra information. In the context of 
user feedback classification for issue detection, coarse-grained labels provide rel- 
atively vague information about the user experience, which may not be sufficient 
to help developers effectively confirm or understand the underlying issues. 
Figure [2] displays part of the hierarchical label system that SKYNET uses for 
classifying the user feedback on an online video editing system. In the hierarchical 
label system, labels at the top level classify all the user feedback into broad 
categories concerning aspects like “Functionality” and “User Account” of the 
online system, labels at the intermediate level partition the broad categories 
into smaller, finer-grained ones, while labels at the bottom level correspond to 
specific types of experiences that users may have when using the online system. 
Two top-level labels in the hierarchical label system, namely “Unknown” and 
“Non-negative”, are special in the sense that they do not have subordinate labels 
because they are for user feedback texts that report negative user experiences 
of previously unknown types and that do not report negative user experiences, 
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respectively. Since some user experiences of previously unknown types may still 
reveal important issues of the systems, SKYNET conducts extra analysis on the 
related feedback to determine if they report any issues. Section gives more 
details about the analysis. User feedback classified as “Non-negative” will not 
be further processed by SKYNET. 

Figure [2] also lists some example feedback snippets from users of the online 
video editing system and associates the snippets to their corresponding labels. 
Two things from the examples are worth noting. First, users often use different 
words in describing the same issue. For example, the words “save” and “ex- 
port” were used in snippets 1-1 and 1-2 to refer to the action of exporting a 
video, respectively. Second, different words with similar meanings may be used 
to describe user experiences of distinct types. For example, the word “save” was 
used in both snippets 2-2 and 3-2, which report different types of negative user 
experiences. Due to such flexibility in natural language expressions, using word 
combinations like (“save” and “video”) to characterize and group user feedback, 
as was done in previous work [45], may often produce results of low precision. In 
view of that, SKYNET extracts the semantics of the experiences reported in user 
feedback via deep learning and classifies user feedback based on their semantics. 

We do not consider the requirement for an input hierarchy of user feedback 
labels as a major restriction to SKYNET’s applicability for two reasons. First, 
although not every service system readily has a dedicated hierarchy of user feed- 
back labels, hierarchies from similar systems could be used instead to bootstrap 
the application of SKYNET on a new service system since, according to our 
experience, systems with similar functionalities often share hierarchies of user 
feedback labels. Second, a collection of appropriate issue labels is essential for the 
effective management of issues in large online service systems. Developers need 
to devise the labels with or without tool support, and the labels can be organized 
into a hierarchy to drive SKYNET. While the construction of such a hierarchical 
label system may require some manual effort, such investment is worthwhile in 
the long term since a high-quality label system can greatly improve the result 
accuracy of feedback classification and issue detection. 


Cascading Classification SKYNET employs cascading classification to asso- 
ciate user feedback to the labels from the hierarchical label system. Cascading 
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is a particular case of ensemble learning based on the concatenation of several 
sub-classifiers [2]. In SKYNET’s cascading classification for hierarchical labels, 
each sub-classifier targets only the labels at a particular level, and the output 
of a high-level sub-classifier is used as additional input to drive lower-level sub- 
classifiers in the cascade. In such a setting, it is relatively easier for high-level 
sub-classifiers to produce proper classification results since the number of labels 
they need to consider is small and the differences between instances from dif- 
ferent classes are big; It is also relatively easier for low-level sub-classifiers to 
achieve more precise classification results since they only need to focus on the 
labels subordinate to those labels output by high-level sub-classifiers [85]. 

Figure [3]shows the cascade classifier SKYNET employs to categorize the user 
feedback on the online video editing system described in Section B.1] The classi- 
fier contains three sub-classifiers, each for one level of the label hierarchy. Each 
sub-classifier is a two-layer network, with the neural cells on each layer being 
fully connected with each other, and it takes all its parent-level classifiers’ output, 
if any, as input for the current level’s classification. For instance, the top-level 
sub-classifier classifies user feedback based on the highest level labels like “Func- 
tionality” and “User Account” according to the input text embedding. While 
the bottom-level sub-classifier takes both the text embedding and the output of 
the two sub-classifiers at higher levels as input to conduct the most fine-grained 
classification. The connections between classifiers help preserve the cascade re- 
lationship between multi-level labels and improve classification accuracy. 

Particularly, each sub-classifier is a multi-class classifier with a loss function 
defined as L = 4 = 3 ae loss(Yic, Hic), where N is the number of samples, 
C is the total number of classes in the classification, fic is the probability of ith 
training example belonging to the cth class, yic is a binary indicator function that 
represents the ground truth label, while loss(yic, ic) is the cross-entropy loss 
between the classification results and the ground truth. Cross-entropy loss [10] 
is a common loss function for classification tasks, and its value increases as the 
predicted probability diverges from the actual labels. 

The loss function for the overall cascading classification model is defined as 
Loverall = AL, + L2 + yL3. That is, the overall loss Loverai of the model is the 
weighted sum of the loss Lẹ, at the n-th cascading level (1 < n < 3), with a, 8 
and y being the weights of corresponding levels. We assign decreasing values 
0.8, 0.6, and 0.4, to a,@ and y, respectively, based on the intuition that an 
incorrect label at any level will lead to incorrect labels for all the underneath 
levels. With the cascading connections, the weight of the first level sub-classifier 
will be adjusted with respect to the loss of all classifiers at the three levels 
during back-propagation, and the weight of the second level sub-classifier will be 
adjusted with respect to the loss of sub-classifiers at the second and third levels. 


3.2 Issue Detection Based on User Feedback 


While it is useful to classify feedback texts based on the types of user experi- 
ences they report, it is neither necessary nor practical to manually examine all 
the user feedback that reports negative experiences. On the one hand, not all 
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user feedback reporting negative experiences is caused by issues in online ser- 
vice systems that demand manual inspection by developers. On the other hand, 
user feedback reporting negative experiences with popular service systems often 
comes in overwhelming numbers, and therefore it can be prohibitively expensive 
to manually handle all those user feedback. 

To help developers better distribute their time and effort on tasks for issue 
handling, SKYNET only reports issues for negative experiences shared by a large 
number of users. Particularly, SkyNET employs a time series forecasting tech- 
nique to dynamically predict a threshold for the frequency of each known type 
of negative user experience. An alert indicating the discovery of an issue that 
needs to be handled will be raised if negative user experiences of the related type 
get reported more often than allowed by the threshold. 


Issues of Known Types When SKYNET classifies a piece of user feedback text 
to a known type of negative user experience, we say the feedback is an instance 
of the user experience type. By concatenating the instance numbers of a known 
negative user experience type within each time unit, we form time-series data 
about the frequency of that type of user experience. Based on the hypothesis 
that a rising issue of known type will cause outliers in the time-series data of its 
corresponding label, SKYNET determines that there is an issue when the number 
of user feedback reporting a particularly known type of negative experience in a 
time period exceeds a threshold. 

Since the normal frequency of each type of negative user experience is closely 
related to several factors that vary across experience types and over time, adopt- 
ing a fixed threshold for all negative user experience types would be too rigid. 
First, different types of negative experiences naturally occur in different frequen- 
cies. For example, in our experience, it is normal to have in each day a few hun- 
dred users of a large-scale service system reporting that they cannot receive the 
verification code, and the reasons often include things like typos in their phone 
numbers, unstable connections of their phones, and the low response speed of 
their network operators, none of which is indicative of issues in our systems. 
On the contrary, the daily number of users reporting problems with uploading 
files is typically much smaller, and when that number increases significantly, it 
is highly likely that an issue in our system is the cause. Second, the normal fre- 
quency of any type of negative user experience fluctuates at different times in a 
day, a week, or a month. For instance, most negative experiences occur more of- 
ten during the day when most users are active than at midnight when most users 
have fallen asleep. Since predicting a dynamic threshold with historical data is a 
widely accepted way to detect issues [33)21], SKYNET naturally formulates the 
issue detection problem as a time series forecasting problem that predicts the 
normal frequency range for each label based on historical data. 

More concretely, we apply a sliding window strategy for the segmentation 
of each label’s historical data, and we adopt a classical bidirectional long short- 
term memory (BiLSTM) network to learn the historical trends of individual 
labels. The window size is set to 50 time units in the current implementation, and 
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Fig. 4: Expansion of frequency data with feedback type ID, which enables the prediction 
of multiple thresholds with a unified BiLSTM model. 


the window slides with a stride length of one time unit. Note that all outliers— 
data points outside the interquartile range [4|—in the time series are removed, 
the Min-Max normalization [3132] is applied for feature scaling before training. 

BiLSTM is a recurrent neural network that takes historical time series data 
as input to make a prediction based on the trend. To predict a value y; for time 
t, the model takes a series of historical data [2_509,...,%-1] as input, where 
a, represents the feature vector for the time unit immediately after t. During 
training, the model loss is the mean squared error between the actual value y: 
and the predicted value y; for time t. 

Based on the predicted frequency y; for a label, SKYNET calculates the 
threshold th; for the label as y, * dr, where dr is a dynamic ratio calculated as 
log(std([x4_50, ---, Te—1])/mean([xt_-50, ---, Ze-1])). The rationale behind the cal- 
culation of the threshold is that the magnitude of acceptable frequency fluctu- 
ations should be proportional to the absolute value of the frequency prediction 
for the label. For example, when the occurrence of a label increases by ten, this 
fluctuation would be relatively smaller if the label’s regular frequency y+ is ten 
thousand instead of a hundred. We apply a log transformation when calculating 
dr to keep it relatively small. 


Predicting Multiple Thresholds with A Unified BiLSTM Model Usually, predict- 
ing the normal frequency of a particular type of user feedback requires training 
a specialized model with the historical frequency data associated with that type. 
Training one specialized model for each prediction task, however, would cause 
high costs for the application and maintenance of SKYNET. To reduce those 
costs, we expand the values in the time series data for each type of user feed- 
back with the identity of that type and use the expanded time series data of all 
feedback types to train a unified BiLSTM model. The unified model is then able 
to predict the normal frequencies of different types of user feedback. 
Particularly, we expand the feedback frequency data in three steps, as de- 
picted in Figure/4] We first apply one-hot encoding to produce a unique value as 
the identity of each type of user feedback. Since one-hot type IDs generated in 
this way are typically sparse, we then transfer them to a dense vector via a fully- 
connected network g(-). Afterward, the frequency data and the dense vector will 
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be combined to form the expanded frequency data. That is, given the one-hot ID 
6 of a user feedback type and the vectorized frequency %; of this user feedback 
type at time t, the expanded frequency is constructed as zz ® g(ô), where ® 
indicates vector concatenation. Here, the transfer of one-hot type IDs to dense 
vectors is necessary because, without it, all but one dimensions of the input data 
would be for the feedback type ID, and it will be extremely hard for the BiLSTM 
model to learn meaningful knowledge about the feedback frequency. 

Evaluation results of SKYNET on three real-world large-scale online service 
systems, as detailed in Section [4] show that such unification does help improve 
the efficiency, while without significantly sacrificing the effectiveness, of threshold 
prediction in SKYNET. 


Issues of Unknown Types Recall that all feedback reporting previously un- 
known types of negative user experiences will be classified into the “Unknown” 
category, and such feedback may also reveal issues if many of them concern 
similar experiences. In view of that, SKYNET clusters user feedback in category 
“Unknown” periodically (e.g., every half an hour) and raises an issue when the 
number of feedback in a cluster exceeds a threshold. Figure [5] depicts the main 
steps SKYNET takes to detect issues of unknown types based on clustering. 

To increase the chance that user feedback reporting similar user experiences 
gets placed into one cluster, it is important that the embedding properly cap- 
tures the semantic characteristics of the feedback texts. To that end, SkyNET 
naturally uses the fine-tuned ALBERT-Tiny model to generate the deep seman- 
tic embedding of these feedback texts. Feedback clustering solely based on that 
embedding, however, may suffer from the overfitting problem and miss issues 
of unknown types because the ALBERT-Tiny model was fine-tuned w.r.t. the 
input hierarchical label system. Therefore, SKYNET also incorporates the shal- 
low semantics extracted with Word2Vec [27/28] and Smooth Inverse Frequency 
(SIF) [9] to facilitate the clustering. Word2Vec is a pre-trained model that mas- 
ters word associations from a large corpus of text, while SIF uses the vector cal- 
culated as the weighted average of all word vectors to embed a sentence. Given 
a piece of feedback text, SKYNET first applies Word2Vec to produce the em- 
bedding for each token in the text and then converts the token embeddings to a 
sentence embedding with SIF. Afterward, the overall embedding of the feedback 
combining its shallow and deep semantic information is formed by concatenating 
the embeddings produced by ALBERT-Tiny and SIF, respectively. 
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Fig. 6: Cross-domain decision mechanism. The valid public opinion is used to retrieve 
feedback according to both syntactic and semantic similarity from the database in a 
time window. The retrieved feedback results then go through a statistical judgment for 
issue alert. 


Retrieval 


With the overall semantic embedding as input, SKYNET employs the K- 
means algorithm to cluster “Unknown” feedback into groups. Note that, since 
the “Unknown” user feedback usually concerns a wide range of user experiences 
without concentrating on any specific types, we expect the resultant clusters to 
be small in size. Correspondingly, when those user feedback texts form large 
groups, it is highly likely that the feedback in those groups reveals issues in the 
system. Specifically, SKYNET reports an issue if the size of a cluster exceeds 
a threshold Hy = MAX(Mtotai/m * a, 8), where Niota is the total number of 
feedback being clustered, m is the (predefined) number of clusters to produce, 
while both a and 8 are constants. In other words, an alert will be raised if the 
number of feedback in a cluster is larger than both a times the average cluster 
size and a fixed value 3. We conservatively set a to 5 in SKYNET since, according 
to our experience, an issue often causes the size of its corresponding feedback 
cluster to increase by 10 times or even more. is introduced to avoid reporting 
issues merely because the value of Niotai/™m*a is very small, e.g., when the total 
number of user feedback to be clustered is small, and we empirically set it to 10. 


3.3 Issue Detection Based on Social Media Data 


Due to the potentially high cost and the impact that negative public opinions 
may cause when they are overlooked, SKYNET dedicates an auxiliary process to 
detecting issues reflected by posts on social media platforms. 

Compared with user feedback collected from dedicated channels that is more 
informative and has labeled historical data for training, social media posts usu- 
ally contain noisy data, are less structured, and often cover a wide range of 
topics, making it more challenging to extract issue-related information from 
them. In view of that, SKYNET adopts a two-stage denoising process to prune 
out most posts that are either not directly related to the service system under 
consideration or not reporting experiences likely associated with issues. 

More concretely, during the two-stage denoising process, SKYNET first ap- 
plies keyword-based search to filter out posts that do not mention the name of 
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the target service system, and then applies a binary classification model con- 
structed with ALBERT-Tiny to further filter out posts not reporting negative 
user experiences. To train the classification model, we collect product-related 
posts and manually labeled them to distinguish whether they report negative 
user experiences. We refer to all the social media posts that are retained after 
the two-stage denoising process as relevant posts. 

To identify social media posts that report negative experiences likely associ- 
ated with issues, SKYNET employs a cross-domain joint-decision-making process 
based on both user feedback and social media posts. As depicted in Figure [6] 
for each relevant social media post, SKYNET first retrieves similar user feedback 
from past time windows. We consider two types of similarities between user feed- 
back and social media posts. The lexical similarity is calculated using the Lucene 
correlation algorithm that comes with ElasticSearch [8], which is based on the 
classic BM25 algorithm [8]. We consider a piece of user feedback to be a lexical 
match of a social media post if the BM25 score between them is higher than a 
threshold 40. The semantic similarity is calculated as the Euclidean distance be- 
tween the ALBERT-Tiny embeddings of the user feedback and the social media 
post. We consider a piece of user feedback to be a semantic match of a social 
media post if the distance is smaller than a threshold of 0.4. A piece of user feed- 
back is considered a match for a social media post if it is a lexical or semantic 
match for the post. Obviously, it is possible that a piece of user feedback is both 
a lexical and a semantic match of a social media post. 

Given a relevant social media post p, let N, and Ng be the total number of 
matching user feedback for p in the past hour and day, respectively, SkyNET 
raises an issue if N, exceeds the threshold H, = MAX(an* Nn, Bn) or Na 
exceeds the threshold Hy = MAX (aa* Na, Ba), where N, and Ng are the average 
number of matching user feedback for p in each hour and day of the past week, 
respectively, while ay, aa, Bn, and Ba are constants. Intuitively, an alert will be 
generated if (1) the number of similar user feedback in the past hour is larger 
than both œp times the hourly average across the past week and a fixed value 
Bn or (2) the number of similar user feedback in the past day is larger than 
both ag times the daily average across the past week and a fixed value pa. 
We empirically assign 3, 3, 5, and 10 to aj, aa, Bn, and Ba, respectively, in 
the current implementation of SKYNET, and we leave the development of more 
sophisticated techniques for predicting the threshold values for future work. 


4 Experimental Evaluations 


We experimentally evaluated the effectiveness of SKYNET and the usefulness of 
its components based on its application results produced on real-world online 
service systems. Our evaluation aims to address the following research questions: 


RQ1: How effective is SKYNET in detecting issues in industry-level online service 
systems? In RQ1, we assess the effectiveness of SKYNET in issue detection 
in terms of the precision and recall it achieves from a user’s perspective. 
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Table 1: Industry-level online service systems used as the subjects in our experiments. 


ID DESCRIPTION MAU #/FEEDBACK LABEL 


TOP INTERM. BOTTOM 


S1 An online video sharing platform > 600m > 100,000 36 140 360 
S2 An online video editing system > 130m > 1,000 13 188 442 
S3 An online beauty camera platform > 27m > 200 7 51 84 


RQ2: How useful are the individual component mechanisms of SKYNET for the 
overall issue detection? Recall that SKYNET integrates three components 
to effectively detect issues in large-scale online service systems, namely a 
component Cp that applies cascading classification and time series analysis 
to detect issues of known types based on user feedback, a component Cu 
that applies the K-means clustering algorithm to detect issues of unknown 
types based on user feedback, and a component Cp that applies joint decision 
making to detect issues based on social media posts. In RQ2, we investigate 
how much each of these components contributes to the overall effectiveness 
of SKYNET. 


We were not able to experimentally compare SKYNET with iFeedback for 
two reasons. First, the implementation of iFeedback is not publicly available. 
Second, faithfully re-building the tool is hardly viable because important in- 
formation regarding its implementation is missing from the related publication. 
For example, we only know from the publication that iFeedback employs an 
XGBoost-based model to classify whether a time interval contains an issue, and 
it applies a hierarchical algorithm to cluster the user feedback as reporting dif- 
ferent issues [45], but no information about the settings and parameters of the 
model and algorithm adopted in their implementation was given in the publi- 
cation, although those settings and parameters may greatly affect iFeedback’s 
issue detection capabilities. 


4.1 Subject Systems 


In our experiments, we applied SKYNET to three industry-level online service 
systems. Table [I]|summarizes the basic information about the systems. For each 
system, the table gives its ID, a brief description, its number of monthly active 
users (MAUs) in millions, and the average number of user feedback items re- 
ceived per day for the system. System S1 is an online video-sharing social media 
platform, system S2 is an online video editing system, and system S3 is an online 
beauty camera platform. The subjects include systems of different types for dif- 
ferent users, with different magnitudes of MAUs, and receiving different amounts 
of user feedback. The diversity in the subject systems helps to ensure that the 
experiments are representative of SkyNET’s behavior in different situations. 
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4.2 Model Training 


Since all three subject systems mainly target Chinese users, we configured SKYNET 
to utilize a pre-trained ALBERT model [I], the DSG embedding corpora [7], and 
the Jieba text segmentation library for processing texts in Chinese. Mean- 
while, we configured SKYNET to utilize the texts posted on Weibd?] one of the 
biggest social media platforms in China, for issue detection in the experiments. 

For each system, we utilized historical user feedback with labels manually 
assigned by the system developers over a one-month period to fine-tune the 
ALBERT-Tiny model and to train the cascading classification model as a whole. 
To prepare the hierarchical label system, first, we invited the system developers 
to decide which labels associated with negative user experience reporting feed- 
back should be retained as the bottom layer labels. Then, following the principles 
described in Section B.1] the developers were asked to group and summarize the 
bottom layer labels to form the intermediate and top layer labels. Finally, all the 
other labels indicating negative user experiences were converted to “Unknown”, 
and the remaining labels were converted to “Non-negative”. In this way, we pre- 
pared for each online service a hierarchical label system and a large number 
of user feedback associated with those labels. For each constructed hierarchical 
label system, Table [I] gives the numbers of labels at its three different layers. 

Afterward, we followed the standard practice to tune the hyperparam- 
eters to be used with the classification and BiLSTM models. Particularly, for 
each service system, we we selected via random search a group of 10 hyperpa- 
rameters that enables the classification model to correctly label the most his- 
torical user feedback texts, and then we looked for values adjacent to these 
hyperparameters via grid search that produced the highest number of cor- 
rect labels and used the values for the classification model in our experiments. 
The BiLSTM model was trained through stochastic gradient descent [37] on 
the time series data derived from the given historical feedback data. For exam- 
ple, for the experiments on service system S1, the cascading classification model 
used the following non-default hyperparameters: batch size=24; dropout=0.1; 
learning_rate=2e—5; warm_up_proportion=0.1; max_epoch=10, while the BiLSTM 
model used the following non-default hyperparameters: dropout=0.1; max_epoch= 
50; sequence_len=50; learning_rate=0.1; batch_size=24. 


4.3 Experimental Setup 


We applied SKYNET to detect issues in each subject system based on historical 
data collected over a ten-month period of time. Each detected issue was checked 
manually by operators and developers of the systems to confirm whether it 
indicates a real problem that needs to be handled. Moreover, the operators and 
developers also assessed the severity of each issue based on the functionalities it 
may impact, the costs it may incur, and the extent to which users’ experience 
may be jeopardized. An issue is called a severe issue if its impact in at least one 
of those aspects is substantial. 


3 https: //www.weibo.com 
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To answer RQ1, we collected all the issues reported by SKYNET for the sub- 
ject systems as well as the results of manual inspections on the issues. Following 
the practice in previous work [45], we measure the effectiveness of SKYNET in 
terms of the precision and recall of the issue detection results produced by the 
tool. In particular, the precision is calculated as the percentage of real issues in 
all the detected issues, i.e., N/N}, where Ni and NÅ are the numbers of issues 
confirmed by developers and detected by SKYNET, respectively; The recall is 
calculated as the ratio of detected severe issues to all the severe issues recorded 
for the whole experiment period, i.e., Nj /N%, where Nj and N$ are the numbers 
of severe issues detected by SKYNET and recorded by developers, respectively. 
Note that metric recall concerns only severe issues in the system because severe 
issues will be reported eventually due to their high impact even if SKYNET fails 
to detect them, while there is no practical way for us to find out the exact total 
number of real issues in those systems. 

To answer RQ2, we ran SKYNET two more times on all the user feedback 
data and the social media posts to detect issues for the systems, the first time 
with component Cp being disabled and the second time with both components 
Cp and Cu being disabled. Then, we compared the issue detection results from 
the three runs in the number of issues detected as well as the precision and recall 
of the corresponding results. 


4.4 Experimental Results 


In this section, we report on the results produced in the experiments and answer 
the research questions. 


RQI1: Effectiveness Table [2] lists the basic information about the issue detec- 
tion results SKYNET produced on the systems. For each system, the table lists 
its system ID, the numbers of issues detected by SKYNET and confirmed by 
developers, the numbers of severe issues detected by SKYNET and recorded by 
developers, and the precision (PREC) and recall (RECA) achieved accordingly. 

SKYNET detected 2790 issues in total, 2595 of them were manually confirmed 
to be true issues, achieving a precision of 93.0%. As for severe issues, developers 
recorded in total 62 cases for the three systems in ten months, and 58 of them 
were detected by SKYNET, achieving a recall of 93.5%. In comparison, iFeed- 
back [45] was able to achieve 76.2% and 93.2% for precision and recall, respec- 
tively, in its evaluation. SKYNET managed to significantly outperform iFeedback 
in terms of precision while slightly improving the recall. Such results suggest that 
SKYNET is both effective and accurate in issue detection. 

To understand the reasons for SKYNET’s ineffectiveness, we manually in- 
spected all four severe issues that were missed. Three of the four severe issues 
were missed due to minor fluctuations in the number of associated user feedback. 
For instance, one severe issue that SKYNET missed occurred during AB-testing 
of a service system. Since only a small number of users were involved in the 
AB-test, while the issue seriously damaged the user experience of the system, 
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Table 2: Issue detection results produced by SKYNET on the subject systems. 


SID ISSUE SEVERE ISSUE PREC RECA 
DETECTED CONFIRMED DETECTED RECORDED 

S1 2003 1895 51 54 94.6% 94.4% 

S2 507 452 7 8 89.2% 87.5% 

$3 280 248 0 0 88.6% = 


Overall 2790 2595 58 62 93.0% 93.5% 


Table 3: Usefulness of SkyNET’s individual components for issue detection. 


SID SIR Ck Ck + Cu SKYNET (Ck + Cu + Cp) 
i i s i i s i a s 
Ni NÉ NĀ P R NE Ni N P R Ni NG NĀ P R 
S1 54 1975 1870 28 94.7% 51.9% 1997 1889 45 94.6% 83.3% 2003 1895 51 94.6% 94.4% 
S2 8 497 444 5 89.3% 62.5% 507 452 7 89.2% 87.5% 507 452 7 89.2% 87.5% 
83 10) 277 246 0 88.8% - 280 248 0 88.6% - 280 248 0 88.6% - 


Overall 62 2749 2560 33 93.1% 53.2% 2784 2589 52 93.0% 83.9% 2790 2595 58 93.0% 93.5% 


the total number of users affected was relatively small, compared with the num- 
ber of users that routinely access the service provided by the system. Hence, 
no alert was triggered. The severe issue could have been detected if SkyNET 
predicts the threshold frequency of issue-reporting feedback texts as a ratio to 
the total number of users with access to the relevant system feature. SKYNET 
missed the other severe issue of a previously unknown type due to the impre- 
cise clustering of feedback texts. Since various users’ descriptions of the issue 
were quite different, SKYNET’s unsupervised model was not able to group all 
the user feedback reporting the same issue into a cluster. This is not completely 
unexpected since, although we have considered both the lexical and semantic 
characteristics of feedback texts in their embedding, it is not a perfect solution 
yet. We plan to devise more powerful embedding and clustering techniques to 
facilitate the detection of issues of unknown types in the future. 


SKYNET was effective and accurate in detecting issues for large-scale online 
service systems. 93.0% of the issues detected by SKYNET reflect real problems 
that demand manual inspection. 93.5% of the severe issues recorded for the 
systems were detected by SKYNET. 


RQ2: Usefulness of Component Mechanisms Table |3| shows the results 
produced by SKYNET with various components being disabled in issue detection. 
For each system identified by its SID, the table gives the issue detection results 
from using just component Ck, using both components Ck and Cu, and using 
all three components of SKYNET. In each setting, the table lists the numbers of 
issues detected by the tool (N4) and confirmed by developers (Ni), the number 
of severe issues detected by the tool (N$), and the precision (P) and recall (R) 
achieved accordingly. 

When Cx is the only component enabled, SKYNET was able to detect 2749 
issues, among which 2560 were manually confirmed, and 33 severe issues for the 
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systems, achieving the overall precision and recall of 93.1% and 53.2%, respec- 
tively. To put it in perspective, that is 98.7% (=2560/2595) of the real issues 
and 56.9% (=33/58) of the severe issues the tool can ever detect with all its 
components being enabled. Such results clearly show that both cascade feed- 
back classification and dynamic threshold prediction of SKYNET were effective 
in detecting issues based on user feedback. Although the recall that Ck achieved 
in detecting severe issues is relatively low, it is understandable since many se- 
vere issues are of previously unknown types and hence beyond the detecting 
capability of Cy. 


Component Cu helped capture 29 (=2589-2560) real issues and 19 (=52-33) 
severe issues that component Ck failed to detect, which caused the precision of 
the overall result to drop slightly to 93.0% but helped raise the recall of the over- 
all result to 83.9%. The drop in the result precision is understandable since Cu 
essentially detects issues of previously unknown types via unsupervised learning, 
and the results of unsupervised learning are relatively low in general. Compared 
with a few false positives, i.e., reported issues that were manually ruled out as 
they were not real issues, the 19 severe issues detected by component Cu are sig- 
nificantly more important for the developers. Therefore, we believe component 
Cu is a valuable complement to component Cp. Note that only feedback items 
that report negative user experiences of previously unknown types are processed 
by component Cu. 


The issue detection results produced by components Ck and Cy, also enable 
us to directly compare SKYNET and iFeedback’s issue detection capability solely 
based on user feedback. As shown in Table [5] if only having access to user feed- 
back, or when component C, is disabled, SKYNET was able to detect 2784 issues, 
among which 2589 were confirmed to be real ones and 52 were considered severe. 
The precision and recall achieved are therefore 93.0% and 83.9%, respectively. 
Recall that the precision and recall iFeedback achieved were 76.2% and 93.2%, 
respectively. The differences suggest that SKYNET and iFeedback make different 
tradeoffs between issue detection precision and recall. iFeedback is more lenient 
in reporting issues. On the one hand, many issues it reported turned out to 
be false positives; On the other hand, it managed to detect more severe issues; 
SKYNET is stricter in reporting issues. On the one hand, it reported fewer false 
positives; On the other hand, it missed a few more severe issues. 


SKYNET makes up for its relatively low recall in issue detection based on 
user feedback by taking into account also users’ posts on social media platforms. 
Although component Cp only detected 6 more real issues in our experiments, 
all of them turned out to be severe, and missing any of these issues may have 
caused great damage to the company. Therefore, although this component has 
only slightly improved the overall recall, we consider it to be a crucial and non- 
dispensable part of SKYNET. 


All the three components Cy, Cu, and Cp are important for SKYNET to detect 
(severe) issues in an effective and accurate manner. 


Smart Issue Detection for Large-Scale Online Service Systems 183 


Threat to Validity In this section, we discuss possible threats to the validity 
of our findings and show how we mitigate them. 

Construct validity. In our evaluation, a reported issue could be manually 
confirmed or rejected as a real or severe issue, but different people may provide 
different assessments. To mitigate this threat, we directly reused the independent 
issue assessment results from the developers of the service systems. 

Internal validity. SKYNET makes use of a list of parameters, including, e.g., 
the size of the sliding window for BiLSTM and the similarity threshold for match- 
ing social-media posts with user feedback texts. We set the parameters based on 
our experience in the current implementation of SKYNET. Experimental eval- 
uation conducted on three industry-level online service systems produced very 
promising results, suggesting the chosen parameter values are appropriate. Hav- 
ing said that, we are aware that different values for the parameters may influence 
SKYNET’s effectiveness, and therefore we plan to conduct more experiments in 
the future to systematically evaluate the possible influence. 

We were not able to experimentally compare SKYNET with iFeedback for 
reasons stated at the beginning of Section |4| As the result, we compared the 
two tools based on the results they produced on the subject systems in their 
corresponding evaluations. For the comparison to be as fair as possible, we eval- 
uated SKYNET on service systems of similar scales from various categories of 
applications. Moreover, the comparison was based on common metrics precision 
and recall, instead of measurements like the numbers of issues and severe issues 
detected, which greatly depends on the experimental setup. 

External validity. The subject service systems adopted in our experiments 
were real-world services of different scales and from different application do- 
mains. These characteristics help mitigate the risk that our evaluation overfits 
the subjects. In the future, on the one hand, we will continue monitoring the 
execution of SKYNET on existing service systems, on the other hand, we will 
deploy SKYNET on more service systems. We see no intrinsic limitations that 
would prevent SKYNET from working reliably on different online service systems. 


5 Conclusions 


This paper presents the SKYNET technique and tool that utilize user data gath- 
ered from multiple channels to detect issues for large-scale online service systems. 
The technique has been applied to detect issues for three real-world online ser- 
vices based on historical data gathered over a ten-month period of time. The 
produced results suggest that SKYNET is both effective and accurate in detect- 
ing issues and severe issues for large-scale online service systems. 


6 Data Availability 


The SKYNET tool has been integrated into the production issue tracking system 
in the first author’s company. For confidentiality reasons, neither the tool nor 
the multi-channel user feedback can be available for public download. 
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Abstract. An OS microkernel can be extended by implementing ser- 
vices upon it. A service could introduce an object that references a kernel 
object, and implement a group of functions that invokes the functions 
for manipulating the kernel object. We consider the scenario where the 
microkernel has been verified with machine-checkable proofs, while the 
services remain to be verified. Moreover, the verification of the micro- 
kernel is not performed with the verification of subsequent extension in 
mind. We address the problem of how to build sufficiently on the ver- 
ification results for the microkernel, in achieving the verification of the 
services. Our methodology consists of enhancements to the verification 
framework for the microkernel, and the design of invariants for establish- 
ing the connection between the service-level objects and the kernel-level 
objects. Using the methodology, we have conducted a substantial formal 
verification of a group of services extending the inter-task communication 
functionalities of the preemptive microkernel wC/OS-II. Our verification 
uncovers dormant bugs and provides a level of correctness assurance for 
the services that is above what is achievable through extensive testing. 


1 Introduction 


Microkernels provide the most fundamental functionalities of operating systems 
such as task management, inter-task communication, and interrupt handling. 
Microkernels are relatively small in size and simple in structure. Compared with 
monolithic kernels, errors in microkernel-based systems are more likely to occur 
outside of the kernel. Thus, these errors are less likely to crash the entire system. 
A preemptive microkernel allows a task to be interrupted at any point of execu- 
tion, as long as interrupts are enabled in the CPU. During interrupt handling, 
a higher-priority task can be switched to. This mechanism permits the timely 
processing of urgent workloads, increasing the responsiveness of the system. 
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On the downside, the possibility of preemption results in a great number of 
inter-dependencies between tasks. This adds to the difficulty in correctly design- 
ing and implementing the microkernel. Out of concern for correctness, substantial 
efforts have been dedicated to achieving the formal verification of preemptive mi- 
crokernels (e.g., [28]). These verification efforts lay a solid foundation for assuring 
the correctness of the software systems based on preemptive microkernels. 

Since a microkernel only provides the core functionalities in abstracting and 
managing system resources, the extension of the functionalities for a microkernel 
is often required in a given application scenario. The functionality of a kernel 
object Oxni can be extended in the following way. Firstly, a data structure is 
introduced — an instance Osrv of this data structure contains a reference to Oxnt, 
while maintaining some additional attributes. Secondly, the operations that can 
be performed on O,,, are implemented. In these operations, checks and updates 
are performed on the additional attributes in Osrv, and the operations for Oknı 
are invoked to complete the checks and updates on the internal attributes. The 
extension provides a service to the user. We shall refer to Osrv as a service object. 

For instance, the mutexes in a microkernel might not support modes of oper- 
ations such as recursive and non-recursive modes. This feature can be introduced 
in an extension of the microkernel, providing a modes-aware mutex service to 
the user. Firstly, a service-level mutex object can be introduced. Secondly, the 
mode of a mutex can be tracked by an attribute of this service object. Thirdly, 
in an operation that tries to obtain a service-level mutex that the current task 
already owns, the attribute is checked before deciding whether to invoke the 
kernel function for obtaining the mutex or not. 


In safety-critical scenarios, the correctness of the services that extend the 
microkernel can be as important as the correctness of the microkernel itself. A 
reliable way to ensure the correctness of the services is formal verification. If the 
microkernel itself has been formally verified, the formal specifications and proofs 
for the functions of the microkernel could be used as a basis for this verification. 


The formal verification of the services can still be non-trivial. This is true es- 
pecially if the tasks executing the service functions (e.g., the function for obtain- 
ing a modes-aware mutex) can be preempted. In this case, it can be non-trivial 
even to ensure that a service object in use always references a corresponding ker- 
nel object that has been properly allocated and initialized. For the verification of 
the services, another problem is how to achieve good reuse of the specifications 
and proofs for the underlying microkernel. Moreover, if the proofs for the micro- 
kernel have been developed using a verification framework, it would be good to 
sufficiently leverage this verification framework, as opposed to requiring a great 
amount of modification to the verification framework. 

In this article, we address the aforementioned challenges in the formal verifi- 
cation of OS services (in the above sense) that extend a preemptive microkernel. 
Specifically, we consider the case where refinement verification has been per- 
formed for the microkernel, using a variant of concurrent separation logic [9] 
called CSL-R [28,27]. This is the program logic used in the first formal verifica- 
tion of a practical preemptive microkernel with machine-checkable proofs. 
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Fig. 1: The connection between service objects and kernel objects 


The main contributions of this article include: 


1. enhancements to the verification framework of CSL-R to support the com- 
positional specification of the functions implementing the OS services 

2. a design of invariants dependent on auxiliary variables for reasoning about 
the connection between service objects and their underlying kernel objects 

3. results obtained by applying the extended verification framework and the 
invariants design to achieve the formal verification of inter-task synchroniza- 
tion and communication services that extend the corresponding functionali- 
ties of the preemptive microkernel C/OS-II [3] 


Specifically, the enhancements to the verification framework of CSL-R en- 
ables the integration of the specifications for the kernel functions as components 
for the specifications of service functions. The connection between the service 
objects and their underlying kernel objects is shown to satisfy structural prop- 
erties that are generic to the specific purposes and contents of the services. The 
verification of the inter-task synchronization and communication services is per- 
formed in an industrial verification project in the aerospace domain, while these 
services also constitute a module of a system to be more widely used in other 
safety-critical scenarios. We devise the specification of each service function and 
prove that the specification is refined by the code of the function. The develop- 
ment is performed in the Coq proof assistant [1]. This verification is a substantial 
effort, in which we have uncovered problems in extensively tested code. 


2 Challenges in Verifying an OS Service 


We assume a service object (e.g., a service-level task, semaphore, or message 
queue) is implemented as a struct in C. The service object obj contains a pointer, 
obj.ptr, to a potential kernel object of the underlying microkernel. The service 
object contains a number of attributes that are managed outside of the micro- 
kernel. Moreover, we assume that all the service objects of the same kind are 
organized in the array obj_arr. This array is illustrated in the upper part of Fig. 1. 

We consider a kernel object to be active, if the kernel object has been allocated 
and initialized. An active kernel object is expected to be in a consistent state. 
The set of active kernel objects is illustrated in the lower part of Fig. 1. 

A desired integrity requirement about the connection between the service 
objects and the underlying kernel objects is: 
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Requirement 1 If a service object is fully created, then the service object ref- 
erences a kernel object that is in a consistent state. 


This requirement is reflected by the arrow without a cross over it in Fig. 1. If 
the requirement is not met, then an operation on a service object could trigger 
an operation on an inconsistent kernel object. Hence, the proper completion of 
the kernel operation with correct results cannot be guaranteed. 

Another desired integrity requirement about the connection between the ser- 
vice objects and the underlying kernel objects is: 


Requirement 2 Each kernel object is referenced by at most one service object. 


This requirement is reflected by the arrow with a cross over it in Fig. 1. If a kernel 
object can be referenced by two or more service objects, then it is difficult to 
guarantee that all these service objects are consistent with the kernel object. An 
operation on one of these service objects would update the service object and 
the kernel object consistently. But this update could break the consistency of 
another service object with the kernel object. 

It can be nontrivial to ascertain the 
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ray element that corresponds to an un- 
used service object. Line 3 checks if the 
return value of get_free_obj is a valid in- 
dex for obj_arr. If not, then the entries of obj_arr are used up, and the func- 
tion service_obj_create returns. Otherwise, obj_arr[idx].ptr gets the special value 
Dummy at line 4. This value signals that the array entry obj_arr[idx] is reserved — 
it cannot be used by a different task attempting to create a service object. Then, 
the critical region is exited. Afterwards, the kernel function kernel_obj_create for 
creating a kernel object is invoked at line 5. Here, katt is the attribute value 
used to initialize the kernel object. The function returns the pointer to the ker- 
nel object that is allocated and initialized — NULL in case no kernel object can 
be allocated. This pointer is assigned to the kernel object pointer in the service 
object obj_arrlidx] at line 6. Then, it is checked whether the pointer is not NULL. 
The function service_obj_create returns if the kernel object pointer is NULL. Oth- 
erwise, the data attributes of the created service object obj_arr[idx] are initialized 
at line 8. The index idx for this created service object is then returned. 

If Requirement 1 is to be satisfied, the following condition related to the 
function service_obj_create in Fig. 2 should be met. 


Fig. 2: The function service_obj_create 
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Condition 1 After the completion of the assignment p<-kernel_obj_create(katt), 
the pointer p points to an active kernel object if p is not NULL. 


This condition guarantees that the pointer assigned to obj_arr[idx].ptr points to 
an active kernel object — thus a kernel object in a consistent state. This helps 
ensure that the service object obj_arr[idx] references a kernel object that is in a 
consistent state, once the service object is fully created. However, Condition 1 
might not hold, since the data located at the return address of kernel_obj_create 
could be modified by preemptive tasks. Hence, dedicated reasoning is required 
to ascertain that the potential modification of data does not break Condition 1. 
If Requirement 2 is to be satisfied, the following condition should be met. 


Condition 2 After the completion of the assignment p<-kernel_obj_create(katt), 
no service object already references the kernel object pointed to by p. 


If Condition 2 is not met, then the service object obj-arr[idx] could start to 
reference the created kernel object, along with some other service object that 
originally referenced the same kernel object. It appears that the potential kernel 
object that is allocated in a call to kernel_obj_create must be free before the 
allocation. Given the code of service_obj_create, it is unlikely that a free kernel 
object would get referenced from a service object. However, the joint effects of all 
the functions supporting the creation, deletion, and use of the service object are 
more complicated than suggested by this observation. Hence, dedicated formal 
reasoning is required to ascertain the satisfaction of Condition 2. 

In the remainder of the article, we will discuss how to ascertain the satis- 
faction of Condition 1 and Condition 2, thereby ascertaining the satisfaction of 
Requirement 1 and Requirement 2, in a refinement verification of OS services. 
A key ingredient of our methodology is the formulation of invariant conditions 
dependent on auxiliary variables in a separation logic (see Section 5). 

Ultimately, the ability to show that Requirement 1 and Requirement 2 are 
fulfilled supports the formal verification of the service functions against their 
specifications. We will also discuss how to compose these specifications from 
the formal specifications of the underlying kernel functions (see Section 4). This 
enables the reuse of the specifications and proofs for the kernel functions, as 
previously developed in the formal verification of the microkernel. 


3 Refinement Verification of OS Microkernels 


To facilitate the understanding of our technical development, we briefly introduce 
the verification framework for the concurrent separation logic CSL-R [28, 27], as 
well as the formal verification of an OS microkernel using this framework. 


3.1 The Big Picture 


Through the refinement verification of an OS microkernel, a simulation is estab- 
lished between the execution of a concrete system and the execution of an ab- 
stract system. The concrete system consists of client programs, kernel functions, 
and interrupt handlers. The abstract system contains the same client programs 
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Fig. 3: Execution of a microkernel and simulation by a specification 


as the concrete system. In addition, the abstract system contains the specifica- 
tions for the kernel functions and the interrupt handlers. These specifications 
are in the form of abstract programs, as opposed to concrete C or assembly code. 

An example of the simulation between the concrete system and the abstract 
system is illustrated in Fig. 3. In this figure, the concrete system runs two tasks. 
Task 1 calls the kernel function f with the list vl of argument values. This 
function executes a series of steps in a critical region. Then, it needs to wait on 
an event for a given time period. Hence, it calls the function sched() to trigger re- 
scheduling. Suppose task 2 is scheduled for execution. After several steps taken 
by task 2, a tick interrupt comes. The arrival of the interrupt is illustrated by 
4. After the interrupt is handled, the system looks for the highest-priority task 
that is ready for execution. Suppose task 1 has become ready and it is executed 
for another time. Task 1 then finishes the kernel function f and returns to user 
code. In the aforementioned scenario, task 2 is preempted by task 1. 

The kernel function f is specified using the abstract program wp as given by 

we ul := qı (vl); sched; y2 (vl) 


Here, yı and 72 represent two atomic steps of execution. Each step has vl as the 
list of input values. In addition, sched is a primitive for the scheduling operation. 
Moreover, y1, sched, and %2 are sequentially composed. We will give further 
details about the language in which wp vl is expressed in Section 3.2. 

Part of the simulation between the concrete system and the abstract system 
is concerned with the simulation of the execution steps for the function f. The 
abstract statement wr vl is executed in the abstract system after the function f is 
called with the list vl of arguments. The concrete execution steps in the critical 
region are simulated by the atomic step 71. Furthermore, the concrete execution 
steps for sched() are simulated by the execution step of sched. In addition, the 
concrete execution steps taken by task 1 after it is resumed are simulated by the 
atomic step y2. The simulation between the concrete system and the abstract 
system is required to preserve a global invariant. The global invariant is used to 
relate the states of the two systems — further details will be given in Section 3.3. 

The simulation of the concrete system by the abstract system is established 
by reasoning about each kernel function separately. This reasoning is performed 
using the rules of the CSL-R logic. For the kernel function f, the goal of the 
reasoning is to establish the correspondence between the concrete code of f and 
the abstract program w;. The reasoning goes forward (in the sense of [16]) in 
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the concrete code of f, performing symbolic execution of the abstract statement 
wf ul at appropriate points. Thus, the goal is turned into establishing the cor- 
respondence between the remainders of f and the remainders of wr vl, i.e., the 
abstract statements 7 (ul); sched; y2(vll), sched; y2(vl), and y2(vi). 


3.2 The Specification of Kernel Functions 


As illustrated in Section 3.1, a kernel function is specified using a mathematical 
function w. This function maps each list vl of argument values to an abstract 
statement s. This abstract statement is expressed using the values in vl. The 
syntax for abstract statements is given below. 

s n= y(vi) | sched | end ĉ | s1;s2 | s1 + s2 


ô := Some v | None 
where v € Val, ul € Val*,y € Val* x AState x Val’ x AState 


Here, Val is the set of values, Val* is the set of value lists, and Val’ is the 
set of optional values. An optional value is represented by the meta-variable 
ô. Furthermore, AState is the set of abstract states. In the atomic operation 
(vl), y relates the list vl of input values and an initial abstract state to an 
optional output value and a resulting abstract state. Furthermore, end ô signals 
the completion of execution for an abstract statement. In addition, s;s2 is a 
sequential composition. Lastly, s; + s2 is a nondeterministic choice. 

An abstract state X € AState captures as mathematical objects the memory 
content that is relevant to the abstract programs of the kernel functions. For 
example, a C struct s with the members s.a and s.b in the memory can be 
abstractly represented as a pair (a,b) in the abstract state. Overall, an abstract 
state could contain the representations of typical kernel objects such as kernel- 
level tasks, semaphores, mutexes, and message queues. The formal semantics of 
the abstract statements is defined based on reads and updates of the abstract 
state. We omit the definition of this semantics here. 


3.3 Invariants and Fractional Permission 


In a concurrent separation logic, the well-formedness of global resources is ex- 
pressed using a global invariant. Examples of these global resources include the 
kernel data structures for tasks, synchronization objects, etc. In a concurrent 
separation logic that supports refinement verification, the global invariant I is 
interpreted over a concrete state and an abstract state. Thus, J can be used to 
assert the well-formedness of the global resources in concrete and abstract rep- 
resentations and the relation between the two. Hence, if the struct s mentioned 
in Section 3.2 is global, then J can be used to assert the well-formedness of s 
in the memory, the well-formedness of the tuple (a,b) in the abstract state, and 
the fact that a and b properly represent the memory values of s.a and s.b. 

In reasoning about a kernel function, the global invariant J can be asserted 
to hold after entering a piece of code that has exclusive access to the global 
resources (e.g., a critical region in which a task cannot be preempted). The aux- 
iliary information provided by this assertion of J can be used in the subsequent 
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Fig. 4: The abstract statement for service_obj_create 


reasoning. The well-formedness of the global resources may be temporarily bro- 
ken in the code, but it must be re-established at the point where exclusive access 
to the global resources is given up. At this point (e.g., where a critical region is 
exited), J must be shown to hold again. Intuitively, a critical region consumes 
well-formed global resources and gives back well-formed global resources again. 

Consider an auxiliary variable that represents the current program location 
for a task. If the global invariant is formulated to depend on such a variable, then 
the variable should be treated as a global resource. However, the variable is then 
modifiable at any point outside of a critical region, by another task that preempts 
the current one. Nonetheless, the current program location of a task should not 
be modifiable by a different task. This is where fractional permission [8] can be 
employed to facilitate verification using a concurrent separation logic. 

More concretely, an auxiliary variable x can be introduced for a task t, such 
that t has 4 permission, and the global invariant has 4 permission, over x. A 
task is allowed (by the program logic) to read a variable, as long as the task 
has i permission over the variable. On the other hand, a task is allowed to 
modify a variable, only if the task has full permission over the variable. Hence, 
the task t is allowed to modify the variable x, when the other $ permission over 
x is obtained from the global invariant, e.g., in a critical region. The variable x 
cannot be modified by any preemptive task t’. This is because t’ is allowed to 
obtain at most i permission over the variable from the global invariant. 


4 Compositional Specification of Service Functions 


4.1 Composing Service Specification from Kernel Specification 


To enable the refinement verification of the function service_obj_create in Fig. 2, 
the function should be specified using an abstract statement. This abstract state- 
ment should reflect the following cases about the execution of service-obj-create. 


1. the execution of service_obj_create could fail, in case there is no usable service 
object in the system, or 
2. service_obj_create could obtain an index vigx for a usable service object, at- 
tempt at kernel object creation as implemented in kernel_obj_create, obtain 
the return value Vere from kernel_obj-create, and then proceed as follows: 
(a) if Vere is the address of a newly allocated and initialized kernel object, 
then service_obj_create sets the kernel object pointer in the viqx-th service 
object to Vere, sets the data attribute in this service object to the given 
attribute value Vsatt, and returns the index value Vidax 
(b) if Vere is NULL, then service_obj_create returns an invalid index value 
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We intend to formulate the abstract statement for service_obj_create using 
the specification language presented in Section 3.2. A potential formulation is 
given in Fig. 4. At the top level, this abstract statement is a nondeterministic 
choice between the part expressing the meaning of item 1 and item 2 above. The 
meaning of item 1 is expressed using the atomic operation Ņierr- The meaning of 
item 2 is expressed with two sequential compositions. Here, the atomic operation 
Yiok 1S used to express the operation of obtaining viqx. Furthermore, wyere is the 
abstract program for kernel_obj_create. In addition, the nondeterministic choice 
between Ycerr and Yok is used to express a choice between the sub-items 2(b) 
and 2(a) above. This particular choice is deterministic because of the conditions 
about Vere as expressed in 2(a) and 2(b). The correspondence between the in- 
formal expression of the functional requirements for service_obj_create and the 
formal counterpart is illustrated by the annotations in Fig. 4. 


The specification of service_obj-create in Fig. 4 is composed of the abstract 
program for kernel_obj_create. This compositional aspect enables the reuse of the 
specification for the functions of the underlying microkernel. This reuse implies 
that the formal proofs for these kernel functions (as developed in verifying the 
microkernel) can also be reused. However, a technical problem was encountered 
with specifications like the one in Fig. 4. The function service_obj-_create has 
two formal parameters (see Fig. 2). According to the CSL-R framework, if the 
abstract program of the function service_obj_create is Wecre, then the result of 
calling the function with the arguments Ukatt and Vsat_ in the abstract system is 
the abstract statement wscre [Ukatt, Usatt|. This cannot be the abstract statement 
in Fig. 4, because the additional parameters viqx and Vere are not introduced. 


To solve the aforementioned problem, we modify the semantics of the speci- 
fication language such that a call to a function could nondeterministically result 
in an abstract statement w (vl++vl’), where w is the mathematical function rep- 
resenting the abstract program for the callee, vl is a list that contains exactly the 
actual arguments for the callee, and vl’ is an arbitrary list of values. Intuitively, 
the list vl’ can be used to accommodate the intermediate values generated in 
the abstract program. For the above example with service_obj_create, we define 
Wsere SUCh that wscre ([Vkatts Usatt|++ul’) yields the abstract statement in Fig. 4. 
We use the first value of vl’ for viax, and use the second value of vl’ for Vere. 


With this abstract statement, we intend to express that the atomic operation 
Yiok identifies a specific index vigx — the vigx-th service object is unused in the 
abstract state from which the operation is performed. Afterwards, the atomic 
operation Yeo, initializes exactly the vjq,-th service object. However, viqx is 
arbitrary if it is the first value of the arbitrary list vl’. How to ensure that viax 
is the index found by iok at the point where the operation Ycok is performed? 


We solve this problem by permitting the execution of an abstract statement 
to reach an error state. From the error state no further execution of the abstract 
statement is permitted. We adjust the refinement condition to express that the 
concrete system should be simulated by the abstract system unless the abstract 
system is in an error state. In the abstract program for service_obj_create, we 
define the atomic operation 7;., such that an error state results if the parameter 
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Fig. 5: Simulation for service_obj_create in the extended verification framework 
(potential preemption before/after atomic operations omitted) 


Viax is not equal to the found index (see Fig. 5). Hence, if Ycok is executed to 
simulate the concrete execution of service_obj-_create, the previous execution of 
Yiok could not have ended up in an error state. Thus, Viax as used in Ycok is equal 
to the index of the unused service object found by Ņiok- 

By admitting the error states in the abstract computation, and extending the 
notion of refinement in CSL-R correspondingly, we permit using the output of 
operations in the subsequent abstract computation. In particular, this enables 
the compositional specification of the service functions — where the abstract 
programs of the kernel functions may produce results that are used in the ab- 
stract programs of the service functions. For sound reasoning about the new 
notion of refinement, we have also introduced new rules into the program logic. 
Formally, we have re-established the soundness of the verification framework. 


Remark 1. In the wC/OS-II microkernel, the computation result of a critical re- 
gion is rarely passed to another critical region via local variables or return values 
of functions. Correspondingly, it is unnecessary to capture the output value of 
an operation and pass this value to another operation in the abstract program of 
a function. Hence, the CSL-R framework for the verification of C/OS-II was not 
originally designed to accommodate additional parameters like vigx and Vere. 


4.2 Expressing Assumptions about the User 


A second use of the error states in the abstract computation (as discussed in 
Section 4.1) is to support the expression of assumptions about user data in the 
formal specification of the service functions. 

For an example of these assumptions, consider a variant of the service func- 
tion service-obj-create in Fig. 2 that works properly only if the argument satt 
satisfies a well-formedness condition. More concretely, suppose satt is intended 
to be a pointer to a struct. This struct contains several attributes for initializ- 
ing the service object. However, the C language does not provide a feature to 
check whether satt really points to a well-formed struct that contains these at- 
tributes (like instanceof in Java). Hence, this check might not be implemented 
in the code of this variant of service_obj_create. Then, service_obj_create should 
be verified under the assumption that satt points to the right type of struct. 

The above assumption can be naturally expressed in the pre-condition for a 
function, if the function is to be verified using an ordinary Hoare-style program 
logic. However, a service function is specified using an abstract program instead 
of pre/post-conditions in a refinement verification. Then, the assumption should 
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be expressed in this abstract program. We express such an assumption in the 
definition of an atomic operation in the abstract program. More concretely, this 
atomic operation gives the error state if the assumed condition about user data 
is not satisfied. With our adjusted definition of simulation, the abstract system 
is required to simulate the concrete system only if the abstract system is not in 
an error state (see Section 4.1). This corresponds to the meaning of assumptions 
— the refinement of the abstract programs by the concrete code is only required 
if the assumptions about user data are satisfied. 


5 Reasoning about Service-Kernel Connection 


Through refinement verification of an OS service, we establish the simulation 
between the execution of the service functions and the execution of their abstract 
programs (see Section 4.1). This simulation preserves the global invariant. 

We express Requirement 1 and Requirement 2 (see Section 2) in the global 
invariant to show that the satisfaction of both requirements is preserved in the 
simulation. As explained in Section 2, the establishment of Condition 1 and 
Condition 2 is supportive of showing the fulfillment of Requirement 1 and Re- 
quirement 2. The two conditions can be established if they are also formulated in 
the global invariant, and are shown to be preserved in the simulation. However, 
these two conditions involve the program location that is local to a task, as well 
as a task-local pointer to a kernel object. These parameters cannot be directly 
expressed in the global invariant. In this section, we explain how to capture the 
program location and the kernel object pointer for each task using auxiliary 
variables with fractional permission (Section 5.2). We then present a design of 
invariant conditions that depends on these auxiliary variables (Section 5.3). We 
are able to show that Condition 1 and Condition 2 are preserved by the execution 
of each service function, with the help of the invariant conditions. 

The satisfaction of Condition 1 and Condition 2 depends on the way each 
service function affects the connection between a service object and its underly- 
ing kernel object. Hence, we will first present a series of code patterns for service 
functions that capture a proper way to handle this connection (Section 5.1). 


5.1 Creation, Deletion, and Use of Service Objects 


We assume that the service functions for creating, deleting, and using a ser- 
vice object possess the code patterns in Fig. 6. The scope of critical regions 
is represented by the dashed boxes. A line with the content Check cond repre- 
sents a conditional that checks the condition cond. A return from the function is 
triggered if the check fails. Before each return from inside a critical region, the 
critical region is exited first. A line in the non-bold face represents an assignment 
to an auxiliary variable. These assignments will be explained later. 


Creation of Service Objects. The function service_obj_create is used to create 
a service object. The code pattern of this function is shown in Fig. 6a. This code 
pattern is the same as in Fig. 2, except for containing two extra assignments 
to auxiliary variables. In addition, the code pattern for the underlying kernel 
function kernel_obj_create is given in the upper part of Fig. 6b. 
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service_obj_create(katt, satt): kernel_obj_create(katt): 
ieee! idx, p „local p 


v7 p <- allocate kernel obj. 


l idx <- get_free_obj(obj_arr) | 
I 

' check valid(idx) i 7; initialize kernel obj. 

| obj_arr[idx].ptr <- Dummy t 


| CurTask->_loc <- _Loc_cre 


I 
p <- kernel_obj_create(katt) ı CurTask->_ptr <- p 


il EE ER EEE EEEE 


| obj_arr[idx].ptr <- p return p 


i Check p != NULL 


L 
L 
! kernel_obj_delete(p): 
[L 
| obj_arr[idx].att <- satt i |// delete kernel obj. at p 
i 
L 
I 


i CurTask->_loc <- _Loc_normal CurTask->_loc <- _Loc_normal 


| CurTask->_ptr <- NULL CurTask->_ptr <- NULL 


return idx return 

(a) creation of service objects (b) kernel obj. creation/deletion 
service_obj_oper(idx): service_obj_delete(idx): 

local p, err local p 


Check valid(idx) 
Check cond(obj_arr[idx].att) 


1 H Check valid(idx) 
I 
L 
I 
p <- obj_arr[idx].ptr ; H Check p != NULL && p != Dummy 
L 
I 
I 
I 


|p <- obj_arr[idx].ptr 


Check p != NULL && p != Dummy Í obj_arr[idx].ptr <- NULL 


CurTask->_ptr <- p i CurTask->_loc <- _Loc_del 
L 


: CurTask->_ptr <- p 


j CurTask->_ptr <- NULL i kernel_obj_delete(p) 
return err return 
(c) use of service objects (d) deletion of service objects 


Fig. 6: The patterns for creation/deletion/use of service/kernel objects 


Deletion of Service Objects. The function service-obj-delete (Fig. 6d) is used 
to delete a service object. The deleted service object is the one represented by 
the array element obj_arr[idx]. Here, idx is the argument of the function. The 
function first checks to ensure that idx is within the array bound for obj-arr. 
Then, the function remembers the kernel object pointer obj_arr[idx].ptr in the 
local variable p. Afterwards, the function checks if the pointer p is neither NULL 
nor Dummy. If so, then obj-arr[idx] should represent a valid service object. The 
function then sets obj_arr[idx].ptr to NULL. Finally, the function invokes the kernel 
function kernel_obj_delete (Fig. 6b) to free the kernel object pointed to by p. 


Use of Service Objects. The function service_obj_oper (Fig. 6c) outlines the 
general pattern for an operation on a service object. First, the validity of the 
index for the target service object is checked. Then, it is checked whether the 
attribute value of the service object satisfies the conditions for performing the 
intended operation. Next, it is checked whether the pointer to the kernel object 
obj_arr[idx].ptr is valid. If so, the kernel function kernel_obj_oper performing the 
corresponding operation on the underlying kernel object is invoked. 
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5.2 Auxiliary Variables with Fractional Permission 


We introduce an auxiliary variable, _ptr, for each task. This auxiliary variable 
reflects the value of the local pointer p at key program locations in the func- 
tions of Fig. 6. We employ fractional permission for _ptr. Half of the permission 
over _ptr is given to the global invariant. Hence, -ptr can be read in the global 
invariant. Half of the permission over -ptr is retained by the task for which -ptr 
is introduced. Hence, _ptr can be used to reflect the value of a local pointer. 

Via built-in mechanisms of CSL-R, we ensure that whenever a task enters 
a service function, the value of _ptr is NULL. This captures that the task is not 
working with a kernel object when entering a service function. When the task 
running a service function gets hold of a kernel object via p, we set _ptr of the 
task to the value of p. For service_obj_create, this is at the end of the critical 
region in the underlying kernel function kernel_obj_create — when the kernel 
object has just been created. For service_obj_delete and service_obj_oper, this is at 
the end of their first critical regions. We reset _ptr to NULL when the task loses 
hold of the kernel object. For service_obj_delete, this is at the end of the critical 
region in the kernel function kernel_obj_delete — when the kernel object has just 
been freed. For service_obj_create and service_obj_oper, this is at their end. 

We introduce an auxiliary variable, _loc, for each task. This auxiliary variable 
reflects the current program location of the task. We employ fractional permis- 
sion for _loc. Half of the permission over _loc is given to the global invariant. 
Hence, this variable can be read in the global invariant. Half of the permission 
over _loc is retained by the task for which _loc is introduced. Hence, the program 
location of each task cannot be modified by a different task. 

Via built-in mechanisms of CSL-R, we ensure that whenever a task enters a 
service function, the value of _loc is _Loc_normal. This reflects that the task is 
not at a special program location concerning object creation or deletion when 
entering a service function. When a task running a service function starts to 
work with a kernel object, we distinguish between the cases for object creation 
and object deletion, by setting -loc to different values. We set _loc to Loc_cre 
for object creation (see Fig. 6b). We set _loc to Loc_del for object deletion (see 
Fig. 6d). We reset loc to _Loc_normal when the task stops working with the 
underlying kernel object. If the service function executed is service_obj_oper, then 
-loc remains at the value _Loc_normal through the execution of the function. 


5.3 Invariant Conditions Dependent on Auxiliary Variables 


Via the auxiliary variables, oc and _ptr, we are able to formalize Condition 1 
and Condition 2. The formulation of these conditions is simpler if the abstract 
representations of data are used instead of the concrete counterpart. We use 
locmp to represent a function from each task identifier to an optional value 
of the auxiliary variable _loc for the task. We use ptrmp to represent a function 
from each task identifier to an optional value of the auxiliary variable _ptr for the 
task. We also introduce the abstract representations of the service objects and the 
kernel objects. We use sobjmp to represent a function that maps each index value 
i to an optional tuple. The tuple represents the service object obj_arr[idx] if idx 
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sobj-kobj-auz(locmp, ptrmp, sobjmp, kobjmp, fkobjs) := 
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where obj_ref (sobjmp, a) := i, att : sobjmp(i) = Some (KObj a, att) 


and ptr_in_fkobj_pool(a, fkobjs) means a is the address of some free kernel object 


cre_del_mut_ex(locmp, ptrmp) := 
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(locmp(t2) E€ {-Loc_cre, _Loc_del} A ptrmp(t2) = Some (Vptr a)) > 
ti = to 


Fig. 7: The invariant conditions sobj_kobj_aux and cre_del_mut_ex 


has the value 7. More concretely, we have sobjmp(i) = Some (KObj a, att) if the 
value of obj-_arr[idx].ptr is a, and the value of obj-_arr[idx].att is att. Furthermore, 
we use kobjmp to represent a function that maps the address of each active kernel 
object to the abstract representation of the kernel object. Hence, the expression 
kobjmp(a) 4 None means that there is an active kernel object at the address a. 

We devise the condition sobj_kobj_aux(locmp, ptrmp, sobjmp, kobjmp, fkobjs) 
as shown in Fig. 7. We make this condition a part of the global invariant. Ac- 
cording to this condition, if a task with the identifier t is working with the kernel 
object at the address a (i.e., ptrmp(t) = Some (Vptr a)), then the task could be 
at a special program location for object creation, at a special program location 
for object deletion, or not at one of these special program locations. These three 
cases are reflected by a disjunctive normal form in sobj_kobj_auz. 


The Use of the Invariant Condition sobj_kobj_aux. The invariant con- 
dition sobj_kobj_aux becomes available to the reasoning task after each critical 
region is entered. The contents of the parameters locmp, ptrmp, sobjmp, kobjmp, 
and fkobjs correspond to the concrete data they represent. The specific parts @- 
@ can be exploited depending on the values of the auxiliary variables. 

We are able to capture Condition 1 and Condition 2 in Section 2 using 
sobj_kobj_aua. If a task t has just completed the assignment p<-kernel_obj_create( 
katt) in the function service_obj_create, then the task is at a special program 
location for object creation (i.e., locmp(t) = Some -Loc-cre). Hence, Condition 1 
in Section 2 is captured by the condition @ in Fig. 7. Furthermore, Condition 2 
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in Section 2 is captured by the condition @) in Fig. 7. Condition @) is expressed 
using the predicate obj_ref. The definition of this predicate is given below the 
definition of sobj_kobj_aux in the upper part of Fig. 7. 

We next explain the use of the condition @. When a task is in the function 
kernel_obj_delete (hence at _Loc_del), the task resets the members of the kernel 
object pointed to by p to their initial values. Condition @ says that p points 
to an active kernel object. This helps ensure the safety of the dereferencing 
operation on p. The condition ®V@ serves an analogous purpose. When a task 
is in the function kernel_obj-oper (hence at _Loc_normal), the task dereferences 
the pointer p to access the members of the kernel object. The condition O®V@ 
says that p points to a kernel object that is either active or in the pool of the 
free kernel objects. Thus, the safety of the dereferencing operation is ensured. 
Here, the disjunction of © with @ is necessary. This is because before the task 
enters kernel_obj_oper, the task can be preempted by another task. The latter 
task could invoke service_obj_delete, obtain the pointer to the kernel object, and 
free the kernel object in kernel_obj_delete. This deletion does not cause trouble 
to the execution of kernel_obj_oper — a sensible design of kernel_obj_oper would 
check whether the kernel object to be used has been freed. This check can be 
implemented using a data member of kernel objects. 


The Proof Obligations for sobj_kobj_aux. Since sobj_kobj_auz is specified 
as a part of the global invariant, a proof obligation in the verification of the 
service functions is to establish sobj_kobj_auz where a critical region is exited. 
Further invariant conditions are supplied for fulfilling this proof obligation. 
Suppose a task with identifier t is about to return to the service func- 
tion service_obj_create from the kernel function kernel_obj_create. There, we have 
locmp(t) = Some _Loc_cre. In addition, if the local pointer p has the value a, 
then we have ptrmp(t) = Some (Vptr a). Hence, condition @ in sobj_kobj_aux 
requires that there be an active kernel object at the address a. Consider a poten- 
tial case where the task t is preempted by a different task t’, which happens to be 
entering the function kernel_obj_delete, with the address a as the value for the pa- 
rameter p. At the point where t’ exits from the critical region in kernel_obj_delete, 
condition @ cannot be established for t. This is because the kernel object at a 
would have been freed by the task t — this kernel object is no longer active. 
To show that the aforementioned scenario involving the tasks t and t’ is 
impossible, we introduce another condition, cre_del_mut_ex, into the global in- 
variant (see bottom part of Fig. 7). The condition says that the actual accesses 
of the special program locations marked by -Loc_cre and -Loc_del are mutually 
exclusive, among all the accessing tasks that deal with the same kernel object 
at some address a. Consider the point where task t’ enters the critical region 
in kernel_obj_delete. The task is then at the program location _Loc_del. If task 
t is about to return from kernel_obj_create, the task is at the program location 
-Loc-_cre. Hence, the kernel object dealt with by t cannot be the kernel object 
that is dealt with by t’, according to the invariant condition cre_del_mut_ez. 
While task ¢’ is in the critical region of kernel_obj_delete, no other task can exe- 
cute. Hence, the kernel object dealt with by t cannot be the kernel object dealt 
with (deleted) by t’, when task t’ exits the critical region of kernel_obj_delete. 
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The Proof Obligations for cre_del_mut_ex. Since cre_del_mut_ex is speci- 
fied as a part of the global invariant, a proof obligation in the verification of the 
service functions is to establish cre_del_mut_ex where a critical region is exited. 

For instance, when a task t exits from the critical region in service_obj_delete, 
the task gets to the program location _Loc_del. Hence, it should be ascertained 
that there is no other task at the program location _Loc_cre, and working with 
the kernel object pointed to by the local pointer p in service_obj_delete. Consider 
the point where task t has just completed the assignment p<-obj_arr[idx].ptr 
in the aforementioned critical region. There, the kernel object Oxn) pointed 
to by p is referenced from a service object. From @) in the invariant condi- 
tion sobj_kobj_auz, if a task t’ is at the program location _Loc_cre and working 
with a kernel object Ogn this Oj, is not referenced from any service object. 
Hence, Ofn, must be different from Oyun. Since the other tasks do not execute 
while the task ¢ is in a critical region, there is still no task at _Loc_cre and 
working with the kernel object Oxy, when the task t exits from the critical 
region in service_obj-_delete. In addition, conditions @) and ©) in the definition 
of sobj_kobj_aux are also used to establish the condition cre_del_mut_ex where 
some of the critical regions are exited. We do not expand on the details. 


Summary of Invariant Design. The invariant conditions dependent on aux- 
iliary variables enable the establishment of structural integrity properties about 
the connection from service objects to kernel objects. This provides a solid foun- 
dation for formally verifying the service functions (if they are implemented with 
the expected code patterns) based on a microkernel that is already verified in 
CSL-R. We provide the formalized code, formal specifications, and correctness 
proofs for the functions in Fig. 6 as part of the accompanying artifact. 


6 Experimental Evaluation 


We apply our methodology in the formal verification of a group of inter-task 
synchronization and communication services implemented as an extension to the 
preemptive microkernel wC/OS-II. These services are developed by a separate 
group of people for safety-critical usage scenarios (e.g., in aerospace vehicles, 
self-driving cars, etc). The services provide functions for manipulating mutexes, 
semaphores, and message queues. These service objects extend the corresponding 
kernel objects of uC/OS-II. For instance, a service-level mutex can be recursive 
or non-recursive, a service-level semaphore can be binary or counting, and a 
service-level message queue can be blocking or non-blocking. This fine-grained 
distinction of object types is not supported by the corresponding kernel objects 
of wC/OS-II. We discuss some key aspects of our formal verification below. 


Application of the Methodology. Almost all the interface functions for the 
inter-task synchronization and communication services invoke the underlying 
functions of C/OS-II to complete operations on kernel objects. This invocation 
is usually performed outside of critical regions. For instance, the service function 
could be pthread_mutex_lock for obtaining a service-level mutex, and the corre- 
sponding kernel function of wC/OS-II would be OSMutexPend. We are able to 
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compose the specifications of the service functions from the specifications of the 
corresponding kernel functions in the extended CSL-R verification framework 
(see Section 4.1). In addition, the service objects are often initialized with point- 
ers to dedicated structs containing attribute values. Our extension to the CSL-R 
framework also enables us to express the assumption that each of these pointers 
points to a well-formed struct of the appropriate type. 

Almost all the service functions are implemented following the code patterns 
in Fig. 6. For each kind of service (for mutexes, semaphores, and message queues), 
we use the method in Section 5 to establish the structural properties about the 
connection between service objects and kernel objects. A complication arises 
because 4C/OS-II has a common pool for kernel objects of different kinds. On 
the other hand, each kind of service object is represented using a different struct, 
and organized in a separate array. In the verification, we establish that each kind 
of service object in use references a kernel object of the same kind, and each 
kernel object is referenced by at most one service object of the same kind. 


Verification Efforts. The source code for the interface functions and the 
newly implemented internal functions totals 1561 lines. Our proof code for these 
functions totals approximately 49k lines. The statistics about the lines of source 
code and the lines of proof code for our verification of the interface functions 
for the mutex service are given in Table 1. The corresponding statistics for the 
verification of the other two services are omitted for space reasons. The overall 
ratio between the verified code and the verification code is about 1:31. This 
ratio is on par with that in the formal verification of wC/OS-II [28, 27]. Owing to 
the compositional specification of the service functions, we did not need to re- 
develop the proofs for the microkernel. Hence, we were able to devote more efforts 
to establishing the structural properties of the connection between the service 
level and kernel level, which made the verification of the services possible. It 
took approximately 3 person years to complete the verification. This included 
6 person months for extending the CSL-R framework as well as designing and 
stabilizing all the invariants that connect the service level and the kernel level. 
Table 1: The statistics about the formal proofs for the mutex service 


Service Function OUTE EEO! Service Function Pouro PTOI 
LOC LOC LOC LOC 
pthread_mutex_init 76 1986 | pthread_mutexattr_init 60 1150 
pthread_mutex_destroy 33 605 | pthread_mutexattr_destroy 21 506 
pthread_mutex_lock 99 2514 | pthread_mutexattr_gettype 36 654 
pthread_mutex_trylock 96 2457 | pthread_mutexattr_settype 38 705 
pthread_mutex_timedlock 106 2765 | pthread_mutexattr_getprioceiling 38 726 
pthread_mutex_unlock 97 2563 | pthread_mutexattr_setprioceiling 39 732 


Problems and Fixes. Through formal verification, we uncovered several prob- 
lems in the code of the inter-task synchronization and communication services. 
This code had been extensively tested before our verification started. The most 
common cause for the uncovered problems is the absence of big enough criti- 
cal regions that ensure the uninterruptible execution of code. The problem with 
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the most complicated cause is: If four tasks create and delete service objects 
concurrently, service objects that are out-of-syne with their corresponding ker- 
nel objects can be brought into existence. For instance, a service-level mutex 
could start to reference a kernel-level message queue, and a binary service-level 
semaphore could start to reference a kernel-level semaphore with a value of 10. 
We uncovered part of the problems after realizing that the services could not be 
shown to preserve some of the conditions in the global invariant — but these 
conditions captured the required or intended behaviors of the services. 

We reported the uncovered problems to the developers of the OS services. 
They performed three main types of modifications to the code. The first was 
enlarging a critical region. The second was adjusting the order of operations. 
The third was introducing dedicated mechanisms to avoid races over global re- 
sources. An example modification to the code was the following. The initial 
implementation of the service function mq_delete invoked the kernel function 
OSQDel before it set the pointer from a service queue to the underlying kernel 
queue to NULL. This order was later reversed such that it agreed with the code 
pattern of service_obj_delete in Fig. 6d. The reason for this reversion was that 
the original order was found to cause the existence of service objects that are 
inconsistent with their underlying kernel objects in a highly concurrent setting. 


7 Related Work 


Our focus is the formal verification of functional correctness for OS services, 
building on the verification results for an underlying OS kernel. However, our 
methodology is also applicable if the service functions are implemented inside the 
kernel. Hence, one type of related work is the formal verification of OS kernels. 

In the literature, there are several developments about the formal verification 
of OS kernels at the implementation level. The seL4 operating system is formally 
verified in terms of functional correctness and information security [21, 20]. In the 
Verisoft project, an operating system kernel encompassing assembly code and 
device drivers is formally verified [5,4]. CertikOS [18,17] is a formally verified 
concurrent OS. It is carefully organized in layers to facilitate verification. The 
commercial preemptive microkernel jsC/OS-II is formally verified in terms of the 
functional correctness of the API functions [28, 27]. In [11], queue data structures 
for inter-process communication are verified using the Iris framework [2]. 

Like our work, the aforementioned developments verify operating system code 
using a proof assistant such as Isabelle [23] or Coq [1]. Unlike our work, these 
developments are not focused on the formal verification of code that builds on 
an OS kernel, by building on prior verification results for the kernel. Our verifi- 
cation is performed for a group of inter-task synchronization and communication 
services. On the other hand, the verification performed in the aforementioned 
related developments either has a comprehensive coverage of the functionalities 
of an OS, or targets a different component than our verification does. 

Apart from the aforementioned related work, several developments (e.g. [25, 
12, 13, 24, 22,6, 7, 29]) formally verify operating systems at a more abstract level 
than we do, or via an approach that is different from ours — such as through 
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model checking or requiring trust in external solvers (e.g., Z3 [15]). In addition, 
some of the existing works [20, 14,30] verify the security properties of operating 
systems, instead of functional correctness as we verify in the present work. 

Our work is about the formal verification of concurrent programs in a broad 
sense. Notable verification frameworks in this regard include Iris [19] and VST [10]. 
These frameworks have no builtin support for the type of concurrency in a pre- 
emptive OS kernel, where the switch between threads is triggered via interrupt 
handling. Our use of the auxiliary variables with fractional permission helps ex- 
press a protocol followed by the concurrent tasks that manipulate the service 
objects. In the literature, there exist techniques with dedicated abstractions for 
expressing the protocols followed by concurrent threads. An example abstrac- 
tion is a state transition system [26]. In the present work, our focus is to achieve 
the required verification by maximally exploiting the features of the verification 
framework for the underlying microkernel. Hence, we have not introduced fur- 
ther abstractions for the expression of protocols. Due to space limits, we stop 
here in our discussion about related work in concurrent program verification. 


8 Conclusion 


We address the problems in formally verifying a group of OS services that build 
on a preemptive microkernel, in case the microkernel itself has been formally 
verified. Specifically, the verification of the microkernel has been a refinement 
verification performed using a concurrent separation logic that supports frac- 
tional permission. Our aim is to build sufficiently on the verification framework 
and verification code for the microkernel, in verifying the code of the services. 
Our methodology consists of enhancements to the verification framework that 
enable the compositional specification of the service functions, as well as a de- 
sign of invariants for establishing structural integrity properties about the con- 
nection between the service level and the kernel level. We use the methodology 
to accomplish a substantial verification task targeting a group of inter-task syn- 
chronization and communication services based on the preemptive microkernel 
pC/OS-II. The verification uncovers dormant bugs and provides a level of cor- 
rectness assurance that is above what can be achieved through extensive testing. 
A potential direction for future work is the design of deductive systems that 
facilitate the verification of global properties for a service, based on the abstract 
programs of all the interface functions of a service. Another direction for future 
work is the verification of progress properties for the functions of a service. 


Data-Availability Statements. The mechanized extension to the CSL-R veri- 
fication framework and proofs for the OS service in abstract form (as described in 
Section 4 and Section 5) are published at Zenodo (10.5281/zenodo. 10456998). 
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Abstract. Attack trees are important for security, as they help to iden- 
tify weaknesses and vulnerabilities in a system. Quantitative attack tree 
analysis supports a number security metrics, which formulate important 
KPIs such as the shortest, most likely and cheapest attacks. 

A key bottleneck in quantitative analysis is that the values are usually 
not known exactly, due to insufficient data and/or lack of knowledge. 
Fuzzy logic is a prominent framework to handle such uncertain values, 
with applications in numerous domains. While several studies proposed 
fuzzy approaches to attack tree analysis, none of them provided a firm 
definition of fuzzy metric values or generic algorithms for computation 
of fuzzy metrics. 

In this work, we define a generic formulation for fuzzy metric values that 
applies to most quantitative metrics. The resulting metric value is a fuzzy 
number obtained by following Zadeh’s extension principle, obtained when 
we equip the basis attack steps, i.e., the leaves of the attack trees, with 
fuzzy numbers. In addition, we prove a modular decomposition theorem 
that yields a bottom-up algorithm to efficiently calculate the top fuzzy 
metric value. 


Keywords: Attack trees - quantitative analysis - fuzzy numbers. 


1 Introduction 


Attack trees. Attack trees (ATs) [32] are a popular tool for modeling and an- 
alyzing security risks. They provide a structural way to identify vulnerabilities 
in a system, by decomposing the attacker’s goal into subgoals, down to basic 
attack steps that a malicious actor can take to reach said objective. An attack 
tree consists of basic attack steps (BASs) representing atomic adversary actions, 
and intermediate AND /OR-gates whose activation depends on the activation of 
their children. The attacker’s goal is to activate the root (top node), see Fig. 1 
for an example. ATs can be trees or directed acyclic graphs (DAGs). ATs have 
been supported by commercial tools [1-3] and equipped with semantics [25, 18]. 
© The Author(s) 2024 
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enter the bank 


(\ 


sneak in break in e vault 
costs: 30 costs: 5 costs: 60 


Fig. 1: The AT model visualises the attack steps by which an attacker can illegally 
take money from a bank. The attacker needs to enter the bank by breaking in 
or sneaking in, and also needs to open a vault. Sneaking in, breaking in, and 
opening a vault cost 30, 5 and 60 minutes, respectively. Hence, the quantitative 
metric minimal cost for the attacks is min(30 + 60,5 + 60) = 65. 


Quantitative analysis. Beyond qualitative analysis, ATs are also used to calculate 
important security metrics of the system, e.g., the minimal cost (in money, time 
or resources) the attacker needs to spend for a succesful attack, or the probability 
of a succesful attack. Such metrics are obtained by assigning an attribute value 
to each BAS, such as the cost needed to perform that BAS, and using this as 
input to calculate the security metric. When the AT is treeshaped, the metric 
is quickly calculated using a bottom-up algorithm, propagating values from the 
BASs to the top. For DAG-shaped ATs this problem is NP-complete, but good 
heuristics exist [22]. These algorithms are formulated in the generic algebraic 
structure of semirings, allowing them to be employed to a vast range of security 
metrics including cost, time, skill, damage, etc. 


Uncertain parameters. The methods described above assume that all BAS pa- 
rameters are known exactly. However, this is problematic in practice: statistics 
on attacker capabilities may be hard to obtain, and because of the fast-changing 
nature of the field historical data are only of limited use. Obtaining accurate 
and realistic parameter values is a key bottleneck in quantitative security anal- 
ysis. In its absence, there is a great need for methods that allow us to deal with 
uncertain and approximately known parameter values. 


Fuzzy theory. Fuzzy theory is a prominent framework in which parameter uncer- 
tainty and its effect on a calculation’s outcome can be expressed mathematically. 
It has been successfully used in many applications, including machine learning 
[7], reliability engineering [6], and computational linguistics [24]. Rather than 
exact (‘crisp’) values, e.g., « = 3, each parameter is assigned a range of values, 
and to each of these a possibility value in [0,1] is assigned by means of a mem- 
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bership function. Often, only functions of a specific form are considered, leading 
to the definition of triangular, trapezoidal, etc. fuzzy numbers [13]. 

While fuzzy theory has been applied to AT analysis before [17, 35,19, 11, 
36], much of the earlier work lacks mathematical rigor, and none of these apply 
fuzzy theory to quantitative analysis. As a result, there are no algorithms for 
calculating AT metrics with fuzzy parameters. In fact, to our knowledge the 
fuzzy counterpart of quantitative AT analysis has not been defined yet. A key 
technical hurdle is that the operations typically used in AT analysis do not 
preserve popular fuzzy number types: for instance, the OR-gate corresponds 
to the operation min for the minimal cost metric, and applying min to two 
triangular fuzzy numbers does not yield a triangular fuzzy number. 


Contributions. Our first contribution is a clear, mathematically rigorous defini- 
tion of fuzzy AT metrics. Because these are defined for general fuzzy numbers, 
rather than specific subtypes such as triangular fuzzy numbers, we sidestep the 
problem that these subtypes are not preserved under AT metric operations; in- 
stead, our definition works for the generic semiring framework defined in [22]. 
We show that our definition naturally follows from Zadeh’s extension principle 
[38], a general approach for extending functions to fuzzy numbers. 

Having defined fuzzy AT metrics, we furthermore develop a linear-time, 
bottom-up algorithm for calculating them for tree-shaped ATs. We show the 
validity of this algorithm by showing that fuzzy AT metrics are susceptible to 
modular analysis: when an AT has a module, i.e., a minimally connected sub- 
component, a fuzzy metric can be computed by first calculating the metric for 
the module and then for its complement. When an AT has many modules, this 
substantially speeds up computation. When an AT is tree-shaped, every node is 
a module, proving the validity of the algorithm. 

Our algorithm generalizes the bottom-up algorithm for crisp AT metrics from 
[22]. Unfortunately, the algorithm for DAG-shaped metrics from that paper does 
not transfer to the fuzzy setting. The key reason is that fuzzy numbers do not 
form a semiring, as we show in this paper. Fuzzy metrics for DAG-shaped ATs 
require a radically new approach, and we leave this for future work. 

Summarized our contributions are: 


1. A rigorous, general definition of fuzzy AT metrics; 
2. A bottom-up algorithm for computing fuzzy metrics in tree-structured ATs; 
3. A proof of modular decomposition for fuzzy AT metrics. 


The full version of this paper (including the appendix) is available on Zenodo[9]. 


2 Related work 


Below, we provide a literature review for computation of metrics with fuzzy 
numbers applied to attack trees and the related formalism of fault trees. 
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Attack tree analysis with fuzzy numbers. An intuitionistic fuzzy set was used 
to represent the uncertainty and hesitancy present in data [17], or an attack- 
defense model was proposed [35,11], or using a fuzzy analytic hierarchy process 
to establish a successful probability model of cyber attack [36,19]. However, 
there have been several studies on the approach of involving fuzzy attribution 
in fault tree analysis (FTA) summarized [37, 15, 31, 14, 23] for many years. 


Fault tree analysis. Fault trees can be considered as the safety variant of attack 
trees: whereas attack trees indicate how malicious attacks propagate through a 
system and lead to damage, fault trees indicate how unintended failures prop- 
agate and lead to system level failures. Therefore, leaves of a fault tree model 
component failures and are called basic events (BEs). Due to their similarities, 
many approaches to fuzzy fault tree analysis can also be applied to attack trees. 
Comprehensive literature surveys on fault trees with fuzzy numbers can be found 
in [37, 23,31, 14]. 


Fault tree analysis with fuzzy probabilities. Fuzzy set theory was firstly used in 
fault tree analysis by Tanaka et al. [34] to address the problem of uncertain 
BEs failure. In the paper, Zadeh’s extension principle was used to estimate the 
possibility of system failure. The failure possibility of the basic events and top 
event were represented as trapezoidal fuzzy numbers. 

Singer [33] considered the distribution of BEs as fuzzy numbers. The mem- 
bership function is continuous and is approximated by left and right functions 
called L-R type fuzzy numbers [10]. Here, L-R type fuzzy numbers are defined by 
a triplet (m,a,b), where m,a,b are positive real numbers. The author extended 
algebraic operations on the triplet of L-R type fuzzy numbers and calculated the 
possibility distribution of the system. 

Kim et al. [16] evaluated the possibility of system failure. Similar to [33], L-R 
type fuzzy numbers are used as the possibilities of BEs. The value m of the triplet 
(m, a,b) is evaluated by four-expert valuations in the form of triangular fuzzy 
numbers (TFNs). Each value m is determined to calculate the optimistic and 
pessimistic possibilities of a system accident. Finally, two cases of possibilities - 
the pessimistic possibility of system failure with major TFN and the optimistic 
one with minor TFN - were determined. 

Lin et al. [21] estimated failure possibility of ambiguous events. For this 
purpose, the linguistic variables describing the evaluation data are expressed in 
triangular or trapezoidal fuzzy numbers denoting failure possibilities. The fuzzy 
possibility of a top event is calculated using the a-cut fuzzy operators. 

Peng et al. [27] presented an approach to fault diagnosis of communication 
control systems. All probability values of the fault tree were converted to uni- 
form triangle fuzzy numbers. The fuzzy probability of the top event was then 
calculated using Zadeh’s principle. A fault tree (FT) consisting of only OR- 
gates was shown as an analytical example to determine the confidence interval 
of probability of top event and achieve fuzzy reasoning diagnosis result. 
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Fault tree reliability analysis with interval arithmetic. Purba et al.[28] devel- 
oped a fuzzy probability based fault tree analysis to propagate and quantify 
epistemic uncertainty raised in basic events. BE reliability characteristics are 
described in fuzzy probabilities. From the BE fuzzy probabilities, the matrix of 
fuzzy probabilities of the minimal cut sets is generated and then the top event 
fuzzy probability is quantified using the Fuzzy multiplication rule in engineering 
applications. 

Purba et al. [29] proposed a fuzzy probability and a-cut based-FTA approach. 
Each fuzzy probability distribution of BEs is represented uniquely by an a-cut. 
The top event a-cut is quantified into the best estimate a-cut, the lower bound 
a-cut, and the upper bound a-cut follow fuzzy arithmetic operations on a-cuts 
of BEs. The approach was verified by evaluating the reliability of a complex 
engineering system and the results are compared to the reliability of the same 
system quantified by conventional FTA. 


Fuzzy FTA by conversion of fuzzy number of BEs to crisp probability of BEs. Hu 
et al. [12] developed an FFTA methodology for analyzing above-ground walled 
storage system failures. Expert elicitation and fuzzy logic was used to manipulate 
the ambiguities and vagueness in the linguistic variables of BEs. Fuzzy proba- 
bility BE was defuzzified to a crisp number. The resultant crisp probability of 
BEs were used as inputs to generate crisp probability of the top event. 

At the time of this writing, fuzzy analysis has not been studied for ATs. The 
literature has introduced fuzzy analysis of FTs, but it only addresses certain 
types of fuzzy numbers (trapezoidal, triangular, etc.). This paper thus provides 
a general mathematical framework for fuzzy analysis of ATs. 


3 Fundamentals of fuzzy theory 


Fuzzy set theory was introduced by L.A. Zadeh [38] to deal with problems in 
which vagueness is present. Instead of considering elements x of a set X with 
a fixed value, we consider fuzzy elements x which can have a range of possible 
values; the extent to which x can be equal to x is expressed by the membership 
degree of x in x, which is a value x[z] € [0, 1]. The value x[z] is the confidence one 
has that x has value x. Here x|a] = 1 denotes full membership, while x[z] = 0 
denotes no membership. 

For instance, the time needed to perform an attack may be given as a real 
number, e.g. x = 3 € R; but often the exact time needed is not known precisely, 
and can be somewhere around 3. This can be represented by a fuzzy number 
x: R — 1 which is 0 everywhere except close to 3, and which has a maximum at 
3 (see Fig. 2). 


Definition 1. Let X be a set. A fuzzy element of X is a function x: X — [0,1]. 
The set of all fuzzy elements of X is denoted F(X) := {x | x: X — [0,1]}. 


In the literature, fuzzy elements are usually called fuzzy sets [38], on the basis 
that the membership function x: X — [0,1] generalizes the indicator function 
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3 3 
(a) (b) 


Fig. 2: A non-fuzzy, ‘crisp’ element x (a) and a fuzzy element x (b). 


ls: X —> {0,1} of a set S C X; thus a fuzzy set can be thought of as a set 
of which elements can have partial membership. Instead, we use the term fuzzy 
element to stress that in this paper, fuzzy elements are used to express the 
uncertainty of individual values, as in Fig. 2b, rather than the uncertainty of 
set membership. A fuzzy element x behaves similarly to a probability density 
function in that the uncertainty of an element of X is expressed by a function 
on X. 

Our definition of fuzzy element is very general. Many works in the literature 
restrict the form of the function x: X — [0,1] to make computation more con- 
venient, especially for X = R, i.e., for so-called fuzzy numbers. Thus there exist 
triangular, trapezoidal, Gaussian, etc. fuzzy numbers [13, 8]. 


Example 1. Consider real numbers a < b < c < d. The trapezoidal fuzzy number 
traPa b,c,a © F(R) is defined as (see Fig. 3): 


2a ifa<a<b, 


b—a 
1 ifb<r<c 
tra a= : 1 
Pascal ] dz, ife < r< d, ( ) 
0, otherwise. 


The trapezoidal fuzzy number trap, b,c,ą has the maximal membership degree 
of 1, i.e., trap, b calx] = 1 for all x € [b,c]. At the same time, a and d are the 
lower and upper bounds of its support, respectively. In case b = c, we have a 
triangular fuzzy number tria bd- 


For notational convenience we occasionally abbreviate x via a list of mem- 
bership values z ++ x[a], omitting x for which x[z] = 0. For example, x = {1 > 
0.7,2 ++ 0.5} € F(Z) is defined by 


0.7, ifa=1, 
x[z] = 4 0.5, if e = 2, 
0, otherwise. 


Arithmetic operations on fuzzy elements are performed following Zadeh’s 
extension principle [13, 4,39, 41,40, 38]. This principle provides a framework to 
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tra Pabod [x] 


Fig. 3: The trapezoidal fuzzy number trap, p c a 


apply functions and arithmetic operations on sets to their fuzzy elements. Before 
giving the full definition, we motivate it by an example. 


Example 2. Consider x,y € F(N) given by 


x={2 0.4, 34 1}, 
y={5rl, 6+ 0.6}. 


We wish to calculate the addition of x and y, which we write as x+y. This is 
also an element of F(N) and so we must specify the confidence (x+y)[z] that the 
sum values to z, for all z € N. Consider z = 8; the sum values to 8 only in one 
of these two cases: 


— x values to 2 and y values to 6; 
— x values to 3 and y values to 5. 


Our confidence that x values to 2 is x[2] = 0.4, and our confidence that y values 
to 6 is y[6] = 0.6. Our confidence that both of these are true, i.e., that the first 
case holds, is then min{0.4, 0.6} = 0.4. Similarly, our confidence that the second 
case holds is min{1, 1} = 1. Our confidence (x+y)[8] that the sum values to 8 is 
then the confidence that either of the two cases above holds; this is expressed 
by the maximum, so 
(x+y) [8] = max{0.4,1} = 1. 

Similarly one can calculate (x+y)|[z] for other values of z, by taking all possible 
outcomes of the sum and calculating their confidence. This yields 


xty = {7 => 0.4,84 1,94 0.6}. 


The idea behind Example 2 can be applied to general multivariate functions. 
The only change that needs to be made is that in general, there may be infinitely 
many pairs (x, y) such that f(x,y) = z; therefore one needs to take the supremum 
over all min{x[z], y[y]} rather than the maximum. 


Definition 2 (Zadeh’s Extension Principle). Let f be a multiargument 
function f : Xı x X2 xX :--X Xn > Y. The Zadeh extension of f is the function 
f: E(X1) x... x F(Xn) > F(Y) defined as: 


( Aen main Xile] Fy), 
F 21,02,--, nell Xi: tbe 
fxi- Xn) ly] = foe, 


0 f(y) =. 
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Based on the extension principle, different arithmetic operations on fuzzy 
numbers have been defined [5, 34, 4, 20,27]. As a result of Definition 2, addition 
and subtraction operations on fuzzy numbers typically have straightforward for- 
mulations. E.g., for two trapezoidal fuzzy numbers we have 


traPa, ,a2,a3,04 + Trapp, .bo,bs,ba = traPai+bı,a2+b2,a3+b3,a4+b4? 


traPa, a2,a3,aa — APD, ,b2,b3,b4 = tAPa,—b4,a2—bs,a3—b2,04—b1* 


Multiplication and division, however, are nonlinear operations that produce 
fuzzy numbers of different types than the operands; for example, the quotient of 
two trapezoidal fuzzy numbers is itself not trapezoidal. For convenience and to 
simplify the computation, the resulting fuzzy number can be approximated by 
a fuzzy number of the same type. The computation and visualisation of these 
estimations can be found in [5]. 

In section 5, we will apply the general fuzzy element framework to formulate 
fuzzy attack tree metrics. Unfortunately, the operators considered in AT anal- 
ysis, such as min, do not preserve triangular, trapezoidal, etc. fuzzy numbers. 
We therefore need to work with fuzzy numbers and Zadeh extensions in full 
generality as defined above. 


4 Attack trees 


In this section, we provide a brief overview of ATs as presented in [22]. Attack 
trees are hierarchical graphical models that illustrate the attack process. The 
trees are usually drawn inverted, with the root node located at the top of the 
tree and branches descending from the root to the lowest levels of the tree — the 
leaves. The root node represents the attacker’s overall objective. The leaves in 
ATs are called Basic Attack Steps (BASs) representing the attacker’s activities. 
Nodes between the leaves and the root node depict transitional states or attacker 
sub-goals. These intermediate steps are equipped with logical gates that indicate 
whether an intermediate step succeeds, e.g. the AND-gate succeeds if all input 
children succeed, the OR-gate is successful if at least one child does succeed. 


Definition 3. [22] An attack tree is a tuple T = (N,E,t), where (N,E) is a 
rooted directed acyclic graph, and t is a map t: N — {BAS, OR, AND} such that 
t(v) = BAS if and only if v is a leaf for alv E€ N. 


The root of T is denoted Rr, and the set of children of a node v is denoted 
ch(v) = {w € N | (v,w) € E}. The set of basic attack steps is denoted BAS; = 
{v E€ N | t(v) = BAS}. 


4.1 Semantics for attack trees 


The semantics of an AT are defined by its successful attacks, i.e., attacks that 
activate the top node. Formally, an attack is a subset A C BAS. For example, 
in Fig. 1, {p,r} is an attack, corresponding to stealing money by breaking in 
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and then opening the vault. An attack’s success is most conveniently expressed 
by the structure function, which is defined recursively as follows: 


Definition 4. [22] Let T be an AT. The structure function fr: N x 2BA8r 5 
{0,1} of T is defined, for a node v E€ N and an attack A C BASr, by 


1 ift(v) =OR and Ju € ch(v) s.t fr(u, A) =1, 
1 ift(v) = AND and Vu € ch(v) s.t fr(u, A) =1, 
fr(v, A) = 1 ift(v) =BAS and v € A, e) 


0 otherwise. 


An attack A is said to reach a node v if fr(v, A) = 1, i.e. it makes v succeed. 
If no proper subset of A reaches v, then A is a minimal attack on v. The set of 
minimal attacks on Rr is denoted |T]. For example, the AT from Fig. 1, has 
three successful attacks: {r,q}, {r, p}, and {r,q, p}. The first two are minimal, 
so we have: [T] = {{r,q}, {r, p}}. 

Discussion regarding attacks and semantics for ATs are presented in [22]. 
Note that adding BASes to an attack will not make it less successful; hence the 
successful attacks are determined by [7]. This leads to the following definition 
of the semantics. 


Definition 5. The semantics of an AT T is its suite of minimal attacks [T]. 


4.2 Security metrics for attack trees 


Quantitative AT analysis may concern various attributes, such as cost, time, 
damage, etc. To handle all these attributes in a generic way, analysis algorithms 
work over a so-called attribute domain (V,V, A). Here V is the value domain for 
the attribute, e.g., R>o for costs, and [0,1] for probability. Furthermore, V and 
A are binary operators on V, where V denotes the way values are propagated 
over an OR-gate: If T = OR(a, b) and a,b are BASs assigned metric values Xa, £o, 
then zaV is the security value of T. Similarly A is the operator corresponding 
to the AND-gate. For technical reasons we assume V and A satisfy some algebraic 
properties, which is encoded in the definition of a semiring. 


Definition 6. [22] A semiring is a tuple (V,V,A) where V is a set, V and A are 
commutative associative binary operators on V, and A distributed over V (i.e. 
x A (yVz) = (x A y)V (x A z)). 


To assign a metric value to an AT T, one chooses a semiring V in which the 
metric takes value, as well as a BAS value x, € V for each BAS a; this is encoded 
as a vector Z € VBAST , The calculation of T proceeds in two steps: first, we assign 
values to an attack A = {a,,...,a@,}. Since all BASs have to be executed, we set 
ma(Z) = A; La, This corresponds to the cost /damage/probability /etc. of the 
attack A, given the BAS values %. Next, we calculate the metric value of T as a 
whole. To do this, we consider the set of all minimal attacks [T] = {A1,..., Am}. 
Since for the top node to be reached one only needs one minimal attack, the 
metric value for T is calculated via m7(Z) = Vi, ma, (2). 
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Example 3. We consider the minimal cost metric that assigns to an AT the 
minimal cost the attacker needs to spend to successfully reach the top node. This 
corresponds to the semiring (N, min, +). Indeed, the cost needed to activate the 
top node in OR(a,b) is the minimum of the costs za and x», as only one of the 
two children needs to be activated; hence V = min. Similarly, an AND-gate needs 
to activate all children, so their costs need to be added and A= +. Then given 
a vector # € Ree? assigning a cost value £a € Ryo to each BAS a, the metric 
value of T is defined as mr(#) = min yepr] X aca Za- Here X ac 4 Ta is the total 
cost of performing an attack A, so the metric value corresponds to the cost of 
the cheapest minimal attack. Consider the AT T = AND (r, OR(q, p)) in Fig. 1. 
Recall that [T] = {{r,a},{r,p}} = {A1, A42}, and consider an attribution 7 
given by x, = 60,7, = 30,x, = 5. Then the metric can be calculated as follows. 


mr(#) = min ( 5 Tas DD za) 


acAı a€ Ag 
= min(60 + 30,60 + 5) = 65. 


Formalizing the discussion and example above leads to the following defini- 
tion. 


Definition 7. [22] Let T be an AT and let (V,V,A) be a semiring. 


1. An attribution of T in V is an element £ of VPAST , 


2. Given an attribution Z, the metric value of T given V and T is defined as 


mr(@)= V Ataev. (3) 


AET] aE A 


As is implicit from the notation, we consider a metric to be a function 
mr: VPASr — V that takes as input the vector 7 of BAS attribute value (e.g. 
BAS costs), and outputs the AT’s security value (e.g. minimal cost needed to 
succesfully attack the AT). This viewpoint is useful when extending AT metrics 
to the fuzzy setting in the next section. 


5 Fuzzy metrics for attack trees 


To define fuzzy AT metrics — as stated, to the best of our knowledge no such 
definition exist yet — we equip each BAS with a fuzzy element of V, i.e., an ele- 
ment of F(V). Thus, a fuzzy attribution is an element X of F(V)®497, assigning 
a fuzzy element Xa to each BAS a. For crisp metrics, the AT’s metric value is 
obtained by applying a function mr to the crisp attribution vector Z, as outlined 
in Definition 7. Analogously, we obtain the fuzzy metric value by applying mr 
to X, where mr is the Zadeh extension of mr. 


Example 4. Consider the AT T = AND(r,OR(q,p)) from Fig. 1; recall that 
IT] = {{r,q}, {r,p}}. We consider the minimal time metric, corresponding to 
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the semiring (R>o, min, +). For this semiring, consider the fuzzy attribution 
X = (Xp, Xq; Xp) given by x, = {50+ 1,60 = 1},x4 = {0 1}, and xp = {5 1}, 
respectively; that is, q and p have crisp time values, and r either takes time 50 
or 60, with equal possibility. 

Since the minimal attacks are {r,q} and {r,p}, the function mr: V3? > V 
is given by Mg (Tr, £q, £p) = min(x, + Lq, £r + £p) for all £r, £q, £p E V. Then 
the fuzzy metric value is equal to Mmr(Xxr,Xq;Xp). Using the definition of Zadeh 
extension from Definition 2, the confidence that this fuzzy metric value is equal 
to a y € Rso is equal to 


mr (X)[y] = sup min (xr ERAEN [zp]) : 
Lr ,Lq,LpER>So: 
min(&@,-+%q,tr+%p)=y 


Since x,[%q] # 0 only for xq = 0, where x,[x,] = 1, we only need to consider 
£4 = 0, and, for the same reason, we only need to consider x, = 5. Thus the 
expression above is equal to 


1, if y= 50 or y = 60, 


sup min (x; [xr], 1, 1) = t otherwise 


Tri 
min(£r,&r+5)=y 


so Mr(X) = {50 > 1,60 > 1}. 
Formally fuzzy AT metrics are then defined as follows. 


Definition 8. Let T be an AT and let (V, V, A) be a semiring. 


1. A fuzzy attribution is an element X of F(V)BAST., 


2. Given a fuzzy attribution X, the fuzzy metric value of T given V and X is 
defined as mr(X), where mr: F(V)BAST > F(V) is the Zadeh extension of 
the function mr from Definition 7. 


More concretely, Mmr (X) is the fuzzy element of V defined, for y € V, by 
mr(X)[y}= sup min x,[2,] 


#CVBAST. veEBASr 


mr (Z)=y 


II 


S in x ‘ 4 
go. gen a 


V actr] Daca Za=y 


Our choice of using Zadeh’s extension to extend crisp AT metrics to fuzzy AT 
metrics is justified by the fact that Zadeh extension treats the input fuzzy num- 
bers X1,...,Xn as independent, i.e., it assumes that there is no nontrivial joint 
fuzzy distribution on the product space [[,; X; of which the x; are the marginal 
distributions [30]. This is a standard assumption on BASes (See [26] for a sim- 
ilar viewpoint on fault trees) which we follow. In theory, one could extend the 
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definition to allow non-independent BASes with more complicated joint fuzzy 
distributions. However, the prevailing viewpoint is that such relations should be 
explicitly modeled into the AT itself. For example, if the non-independence is 
due to a common cause affecting the joint distribution of multiple BAS attribute 
values, then this common cause should be explicitly modeled into the AT frame- 
work by replacing the BAS by sub-ATs with shared nodes [26]. We will follow 
this philosophy and use the Zadeh extension as the natural way to define fuzzy 
AT metrics. 

An alternative way of defining fuzzy AT metrics would be to replace the crisp 
operators V, A in (3) with their fuzzy counterparts V, A. However, this does not 
coincide with our definition, as the following result shows: 


Theorem 1. In general, 
mr (x) = V A Xa, (5) 
AE[T]aEA 
This result is shown by the following example. 


Example 5. We continue Example 4, where Mmr (xp, Xq, Xr) = {50 > 1,60 +> 1}. 
On the other hand, 


V A Xy = min (x, Fxg, Xr FXp): 
AE[T]vEA 


One could calculate this fuzzy number in a manner analogous to Example 
4, but here we show another method that is often more convenient. For a fuzzy 
number x € F(R>o), define x) = {x € Ryo | xz] = 1}; this is the level 1 a—cut 
of x [13]. Then from Definition 2 one can deduce that for x,y € F(Rso) and 
f: RZo > Rso one has 


(F(x, y)) = {fay | we x,y € yO}. 


For brevity we abbreviate the right hand side of this equation to f(x,y). 
It follows that 


~; ~ by (1) 2 TA 
(min (x. Fx, x- Po) | = min (xr Fx) ®, (xr Fxp)®) 


( 

min( 
= min({50, 60} + {0}, {50,60} + {5} 

min( 


Hence | V Ax | [2] = 1 if and only if x € {50,55,60}. Since this fuzzy 
AE[T]vEA 
number only takes possibility values 0 and 1, it follows that 


VA % = {50> 1,55 1,60 = 1} # {50 > 1,60 > 1} = MX, Xq: Xr). 
AE[T]vEA 
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trig 14 tri, 23 min(trig 1,4, tri, 2,3) 


ny 
an 


Fig. 4: Two triangular fuzzy numbers and their minimum, as a Zadeh extension 
of the function min. 


The ‘extra’ possibility 55 ++ 1 on the LHS comes from comparing the attack 
{r,q} with cost 60 + 0 to the attack {r,p} with cost 50 + 5. In other words, 
in this comparison r is considered to have costs 50 and 60 simultaneously. By 
contrast, in the calculation of Mmr(X) the cost x, can only have one value at a 
time. 


Equation (5) shows that a priori, there are two ways one can define fuzzy AT 
metrics. We choose to use the definition of m7(X) via Zadeh’s extension as in 
Definition 8 for two reasons: first, this accurately captures the independence of 
the BASes as outlined below Definition 8. Second, we show in Theorem 3 that 
this definition satisfies modular decomposition, a fundamental property of AT 
metrics. The RHS of (5) does not satisfy modular decomposition, giving another 
argument why Definition 8 is the preferred definition (see Remark 2 below). 


Example 6. Consider the AT T = OR(a, b) with the min cost metric, represented 
by the semiring (R>o,min,+). As fuzzy attributions consider x, = trio,1,4 and 
Xp = tri1,2,3. Then one can show (see Fig. 4) that mr(X) = min(Xq, Xp) is given 
by 


£, if0<e¢<1, 

= 1-2, ifl<«<25, 

min(Xq, x5) [2] = age if25<a<83 
0, otherwise. 


In particular min(Xq, xp) is not a triangular fuzzy number. Hence triangular fuzzy 
numbers are not preserved by the operations inherent to AT analysis. The same 
holds for other popular subtypes of fuzzy numbers such as rectangular numbers; 
for this reason, we define fuzzy quantitative AT analysis for general fuzzy num- 
bers in Definition 8. Finding subtypes of fuzzy numbers that are preserved by 
AT analysis operations forms an interesting avenue for future research. 


Remark 1. Besides AT metrics as defined in this paper, in [22] quantitative anal- 
ysis for so-called dynamic ATs (DATs) is also defined. DATs include a new gate 
type SAND (“sequential AND”) used when attack steps have to be performed 
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in sequential order; the normal AND-gate allows its children to be performed 
in parallel. This changes both semantics and quantitative analysis: an attack is 
now a partially ordered set (A, <) rather than just a set A of BASes, to denote 
the relative timing behaviour of the attack steps; and for quantitative analysis 
a third binary operation > is introduced to correspond to SAND-gates, and the 
metric is defined in terms of these operators. 

The results of this paper straightforwardly carry over to the DAT setting. 
That is, fuzzy DAT metrics are defined as the Zadeh extension of crisp DAT 
metrics akin to Definition 8. Furthermore, this definition satisfies modular de- 
composition, which follows from the modular decomposition of crisp DAT metrics 
analogous to Theorem 3. As a result, a bottom-up algorithm analogous to Alg. 1 
calculates fuzzy DAT metrics for treelike DATs. 


6 Metric computation for ATs 


To calculate the fuzzy AT metric mr(x) directly from Definition 8, one first 
needs to calculate the function my, which in return requires one to find [T]. In 
general, this set is of exponential size, making calculation cumbersome for large 
ATs. Therefore, dedicated algorithms for quantitative AT analysis are needed. 
For crisp AT metrics these are described in [22]. In this section, we define a 
bottom-up algorithm for calculating fuzzy AT metrics for tree-shaped ATs, and 
we show that its validity follows from the fact that fuzzy AT metrics satisfy 
modular decomposition. We also show that the BDD-based approach for metric 
calculation for DAG-shaped ATs from [22] does not extend to the fuzzy case, 
and that a radically new approach is needed. 


6.1 Bottom-up algorithm 


The bottom-up algorithm presented in Algorithm 1 is adapted from the bottom- 
up algorithm for crisp AT metrics first presented in [25]. It takes as input an 
AT T, a node v of T, a semiring D = (V, V, A), and a fuzzy attribution X, and 
outputs a fuzzy value BU(T, v, D,X) € F(V) assigned to v; this value corresponds 
to the metric value associated to reaching v. If t(v) = BAS, this is simply x,. If 
t(v) = OR, then BU(T, v, D,X) is obtained by applying V to the values associated 
to the children of v; for t(v) = AND we instead use A. The AT’s fuzzy metric 
value is then given by BU(T, Rr, D,x). 


Theorem 2. Let T be a static AT with tree structure, D = (V, V, A) a semiring, 
and X a fuzzy attribution with values in V. Then mr(X) = BU(T, Rr, D,x). 


Example 7. We apply the algorithm to Example 4. Then the algorithm calculates 
the metric as follows 


BU(T, Rr, D,xX) = BU(T, r, D,X) X BU(T, min(q, p), D, x) 


II 


BU(T,r, D,3) A (BU(T, q, D, 3) ¥ BU(T, p, D,3)) 
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Input: attack tree T = (N, E, t), 
node v E€ N, 
semiring attribute domain D = (V, V, A), 
fuzzy attribution X € F(V)PAST, 
Output: Fuzzy element BU(T, v, D, Z) € F(V). 
if t(v) = OR then 
| return V BU(T,w,D,X) 
wEch(v) 
else if t(v) = AND then 
| return A BU(T,w,D,X) 
wEch(v) 
else /* t(v) = BAS */ 
| return x, 
end 


Algorithm 1: BU for tree-structured AT T. 


= sup min (x, [zr], sup min(xq[2q],Xp[Xp]) ) 
Lr, LqvpER>so: Lq,LpER>so: 
Lpt+Lqavp=Y min(%g,tp)=Lqvp 
= sup min (x, [zr], Xq [xq], Xp [zp]) 


Lr ,Lq,LpERSo: 
tp+min(eg,%p)=y 


= sup min(x,[2z,], 1, 1) 


zrER>0: 
£r +min(0,5)=y 


_ ji, ify= 50 or y = 60, 
7 0, otherwise. 
= {50 > 1,60 => 1}. 


The algorithm is efficient as we can see that it is linear in |E|, making it 
vastly more efficient than first calculating my and then Zadeh-extending it. The 
algorithm is generic as it is applicable to popular quantitative metrics in ATs 
such as cost, damage, skill, probability, etc. [22]. We should note, however, that 
the linearity of the time complexity assumes that the fuzzy operations V and A 
take constant time. 

While the algorithm applies only to tree-structured ATs, this covers a large 
portion of the ATs found in the literature [25]. As such, the algorithm can be 
used in many applications. 

As we show in the appendix of [9], the proof of Theorem 2 depends on a 
fundamental property of AT metrics called modular decomposition. In the next 
section, we will explain this and show that fuzzy metrics satisfy this property. 


6.2 Modular decomposition 


Modular decomposition is a fundamental property of AT metrics as it facilitates 
the recursive solution of many problems, which typically improves performance. 
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For a node v in an AT T, let T, be the AT consisting of all descendants of 
v, i.e., the nodes w for which there exists a path v > w. This is a rooted DAG 
with root v. A module is a node v for which T, is only minimally connected to 
the rest of T: 


Definition 9. Letv € N \ BAS. We call node v a module if v is the only node 
in T, with connections to T \ Ty. 


For instance, in Fig. 1, the modules are “enter the bank” and “get money”. 
Finding the modules of an AT aids in calculating metrics as follows. Given a 
module v, one can split up T into two parts: the sub-AT T, with root v, and 
the ‘quotient’ T” obtained by replacing the entire sub-AT v with a single new 
node, which we will still call v (see Fig. 5). Then one can calculate the metric for 
T, to find m7, (x), and use this as a BAS attribute value for v in T”. One then 
calculates the metric value for T” with this new BAS value. In [22, Thm. 9.2] 
it is shown that for crisp metrics this results in the same metric value for T 
as when one considers the entirety of T at once. As a result, we can split up 
metric calculations via a divide-and-conquer approach once one has identified 
the modules. The following theorem shows that this also holds for fuzzy AT 
metrics. 


Theorem 3. Let (V,V,A) be a semiring. Let v be a module in an ATT, X€ 
F(V)BAST be a fuzzy attribution for T. Let X, € F(V)PASt. be the fuzzy attribu- 
tion for T, obtained from restricting x, i.e., (Xv)w = Xw for all w E€ BASr,. Let 
T” be the AT obtained by replacing T, in T by a single BAS still called v. Let 
X? c F(V)BAST” be a fuzzy attribution for T” given by 


Then Mr(X) = MT» (X°): 


The theorem is the extension of Theorem 9.2 of [22]. The proof of Theorem 3 
is shown in the appendix of [9]. In a treelike AT, every node is a module, and 
applying modular decomposition then yields Theorem 2. 


Remark 2. In the same way that Theorem 3 can be used to prove Theorem 2, it 
can also be used to show that the alternative definition of fuzzy AT metrics in the 
RHS of (5) does not satisfy modular decomposition. Namely, if the alternative 
definition would satisfy modular decomposition, Alg. 1 would also calculate the 
alternative definition for treelike ATs. However, since this does not conform to 
our Definition 8 even for treelike ATs (see Theorem 1), we conclude that the 
alternative definition does not satisfy modular decomposition. 
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Fig. 5: Calculation of mr(X) can be done by computing Mr» (X°), where v’ € 
BAS rv is assigned with fuzzy attribute mr, (Xv). 


(a) 
Fig.6: A DAG AT (a), and its BDD (b). 


6.3 Computations for DAG ATs 


Directed acyclic graph (DAG) ATs refer to ATs in which a node has more than 
one parent [22]. Fig. 6a visualizes an AT with DAG structure. Unfortunately, 
Alg. 1, does not correctly compute the (fuzzy) metric value of DAG-shaped ATs. 
The reason for this is that the algorithm does not detect whether a node’s child 
is shared with another node or not, which leads to double counting of a child’s 
metric value. 


Example 8. Let x, = {1 > 1},x, = {0 > 1,3 > 1},xw = {1 > 1}, and 
D = {N, min, +}. The min cost computation for the DAG AT shown in Fig. 6a 
using algorithm 1 gives BU(T, Rr,x,D) = min(x,,x,) + min(x,,x,) = {0 > 
1,1 => dF {0 > 1,1 => 1} = {061,161,261}, whereas Mr (Xu, Xv, Xw) = 
{01,2 1}. 


For crisp metrics, this was solved by the BDD-based approach introduced 
in [22]. Boolean functions are compactly represented by a binary decision dia- 
gram(BDD), a type of directed acyclic graph. One can apply this to the structure 
function of an AT as in Fig. 6b: as one can see, each nonleaf is labeled with a 
BAS and has two outgoing edges, while the leafs are labeled 0 and 1. For a 
given attack A, the BDD evaluates fr(Rr, A) as follows: at a node with label 
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v, follow the dashed line if v ¢ A, and the nondashed line if v € A. The leaf 
in which one ends up holds the value of fr(Rr, A). Every Boolean function can 
be represented as a BDD, and although the corresponding BDD is worst-case of 
exponential size, BDDs are usually quite compact. 

The BDD can also be used to calculate (crisp) AT metrics. We showcase this 
for the minimal cost metric, but it can be applied to other metrics, so long as 
the corresponding semiring is absorbing (see [22]). Minimal cost is calculated 
as follows: for each BAS v, the cost x, is attached to the nondashed edges 
originating from BDD nodes with label v, while each dashed edge gets label 0 
(see Fig. 6b). Then the attack with minimal cost corresponds to the shortest path 
from Rr to 1 in the BDD; since the BDD is acyclic this computation is linear 
in the size of the BDD. In total, this means that this is worst-case exponential 
in the size of the AT, but in practice the calculation is quite fast. 

Unfortunately, this approach no longer works for fuzzy AT metrics. The rea- 
son is that this approach assumes that the metric arises from a semiring, in 
particular, that distributivity holds. As the following example shows, if (V, V, A) 
is a semiring, then (F(V), V, A) is no longer a semiring, because distributivity no 
longer holds. It is therefore no surprise that the BDD method no longer works 
either. 


Example 9. Let (V,V,A) = (Ro, min, +), and consider the fuzzy elements x = 
{0 > 1,2 1} and y = z = {0 > 1}. Then using the methods from Example 5, 
we find that 


min(x+y,x+z) = min({0 > 1,2 1},{01 1,2 1}) 
={0H 1,151,241}, 
xtmin(y,z) = {0 1,24 1}F{04 1} 
={0H 1,24 1}. 


Hence (F(Rso), min, +) is not distributive, and in particular not a semiring. 


The reason that distributivity fails for fuzzy numbers is that, as we discussed 
in Section 5, a Zadeh-extended operator like + acts as though its two arguments 
are independent. However, in an expression like min(x+y,x+z) the arguments 
x+y and x+z are typically not independent. This ensures that distributivity is 
not retained under Zadeh extension. 

Since the BDD method used for crisp AT metrics does not work, a new 
method is needed for calculating fuzzy metrics for DAG-like ATs. This is beyond 
the scope of this paper. One possible way to approach this problem is to find 
a way to keep track of the ‘double counting’ that occurs when applying BU to 
DAG-like ATs, and eliminate it at the end of the algorithm. Such an approach 
would require a radically new, strategy, and we therefore leave it to future work. 


7 Conclusion and future work 


In this paper we define a mathematical formulation for deriving AT fuzzy met- 
rics values. In our knowledge, fuzzy theory has been applied in FTs for imprecise 
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data, but fuzzy quantitative metrics remain somewhat implicitly defined. The 
definition we provide is explicit and generic for commonly used quantitative 
metrics. Moreover, this definition can be used to better capture uncertainty in 
quantitative metrics values. In addition, this paper introduces an efficient algo- 
rithm to calculate AT metrics with fuzzy attribution. The proposed algorithm 
is linear in |E|, as opposed to the definition of fuzzy metrics which requires cal- 
culation of crisp metrics followed by fuzzy operators. The algorithm works for 
tree-like structure models that satisfy modular decomposition. 

In the future, we want to develop an algorithm for fuzzy metrics computa- 
tion on DAG ATs. For that aim, the algorithm should address the non-semiring 
property of fuzzy operators and the DAG structure on ATs. Another avenue for 
future research is the development of subtypes of fuzzy numbers that are pre- 
served by (Zadeh-extended) arithmetic operations inherent to AT analysis, such 
as min and max. Upon formally defining such subtypes, these can then be used 
to implement quantitative analysis algorithms efficiently. 
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Abstract In recent years, more people have seen their work depend on 
data manipulation tasks. However, many of these users do not have the 
background in programming required to write complex programs, par- 
ticularly SQL queries. One way of helping these users is automatically 
synthesizing the SQL query given a small set of examples. Several pro- 
gram synthesizers for SQL have been recently proposed, but they do not 
leverage multicore architectures. 

This paper proposes CUBES, a parallel program synthesizer for the do- 
main of SQL queries using input-output examples. Since input-output 
examples are an under-specification of the desired SQL query, sometimes, 
the synthesized query does not match the user’s intent. CUBES incorpo- 
rates a new disambiguation procedure based on fuzzing techniques that 
interacts with the user and increases the confidence that the returned 
query matches the user intent. We perform an extensive evaluation on 
around 4000 SQL queries from different domains. Experimental results 
show that our parallel approach can scale up to 16 processes with super- 
linear speedups for many hard instances, and that our disambiguation 
approach is critical to achieving an accuracy of around 60%, significantly 
larger than other SQL synthesizers. 


1 Introduction 


In the age of digital transformation, many people are being reassigned to tasks 
that require familiarity with programming or database usage. However, many 
users lack the technical skills to build queries in a language such as Structured 
Query Language (SQL). Hence, several new systems have been proposed for au- 
tomatically generating SQL queries for relational databases [32,20,30,33]. The 
goal of query synthesis is to automatically generate an SQL query that corre- 
sponds to the user’s intent. For instance, the user can specify their intent using 
natural language [30,33] or examples [28,32,20,27]. Our work targets query syn- 
thesis using examples, where an example consists of a database and an output 
table that results from querying the database. The problem of synthesizing SQL 
queries from input-output examples is known as Query Reverse Engineering [29]. 


© The Author(s) 2024 
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10 36933 A 

7 ore T CourseID CourseName CourseName GradeCount 
10 37362 A 10 Programming Programming 4 

12 37362 C 11 Algorithms Algorithms 2 

11 37453 A 12 Databases Databases 3 

10 37510 B ER can. tua! 

12 37510 A (b) The Courses table. (c) The output table. 
10 37955 A 


(a) The Grades table. 


Figure 1: Two input tables: Courses and Grades. Output table: number of grades 
per course. 


Figure 1 illustrates an input-output example with two input tables (Courses 
and Grades) and an output table. The output table corresponds to counting the 
number of grades in each course. In this example, the goal is to synthesize the 
following SQL query: 


SELECT CourseName, count(*) AS ’GradeCount’ 
FROM Grades NATURAL JOIN Courses 
GROUP BY CourseName 


Observe that, for a person with limited database training, it is often easier to 
define one or more examples than to learn how to write the desired SQL query. 

Even though query synthesis tools using examples [28,32,20,27| have seen a 
remarkable improvement in recent years, they still suffer from scalability prob- 
lems with respect to the size of the input tables and the complexity of the 
synthesized queries. Nowadays, multicore processors have become the predomi- 
nant architecture for common laptops and servers. However, none of the previous 
query synthesis tools take advantage of the parallelism available in these archi- 
tectures. In this work, we present CUBES, the first parallel synthesizer for SQL 
queries. CUBES is built on top of an open-source sequential query synthesizer [20], 
which we further improved by extending the language of queries supported by 
CUBES and by adding pruning techniques that can prevent incorrect programs 
from being enumerated. To take advantage of parallel architectures, we extend 
CUBES by using divide-and-conquer. In this approach, each process searches a 
smaller sub-problem until it either finds a solution or exhausts that subspace 
and chooses another sub-problem to solve. We present a novel approach to cre- 
ate sub-problems based on considering different subsets of the domain-specific 
language for each process. 

To evaluate our tool, we collected benchmarks from previous works [32,28,27,20]. 
Also, we created a new dataset by extending existing query synthesis problems 
using natural language [35] to use examples instead. In the end, we collected 
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around 4000 instances that will be publicly available and can be used by other 
researchers when evaluating query synthesis tools. 

We perform an exhaustive comparison between CUBES and state-of-the-art 
SQL synthesizers based on examples [32,20,27]. Our evaluation shows that cur- 
rent SQL synthesizers can synthesize many SQL queries that satisfy the examples 
but do not match the user intent. We observe that all state-of-the-art SQL syn- 
thesizers return fewer than 50% of queries that match the user intent, i.e., even 
though they satisfy the example given by the user they do not match the query 
that the user had in mind. CUBES addresses this challenge by using parallelism 
to find multiple solutions and interact with the user to disambiguate the query 
that matches the user intent. To disambiguate the queries, we use fuzzing to pro- 
duce new examples that result in a different output for the possible synthesized 
queries. We select one of these examples and ask the user if the output is correct 
for these new input tables. If the user responds affirmatively, we can discard 
all queries that do not match this new output. Otherwise, if the user responds 
negatively, we can discard the queries that match the new output. We repeat 
this process until we are confident that we found the query the user intended. 

To summarize, this paper makes the following key contributions: 


— a divide-and-conquer procedure for SQL synthesis (section 2). 

— a new procedure that uses fuzzing to disambiguate a set of queries that 
satisfies the initial example (section 3). 

— a new large dataset for SQL synthesis using examples with around 4000 
instances (section 5). 

— a new open-source SQL synthesis tool called CUBES whose parallel version 
with 16 processes outperforms the sequential version by solving more in- 
stances and having a median speedup of around 15x on hard instances (sec- 
tion 5). 

— a first study that analyses the accuracy of queries returned by SQL synthesiz- 
ers showing that more than 55% of the queries do not match the user intent. 
Our disambiguation procedure improves the accuracy of CUBES to 60% and 
significantly outperforms other example-based synthesizers (section 5). 


2 SQL Synthesis 


In this work, we propose CUBES, a divide-and-conquer query synthesizer that 
builds upon the open-source SQL synthesizer SQUARES [20]. SQUARES is a se- 
quential synthesizer based on enumeration that uses operations from the R pro- 
gramming language as its Domain Specific Language (DSL)*. R is more expres- 
sive than SQL and allows a more compact representation for database queries. 
Since SQUARES is modular and open-source, it is easy to modify and extend to 
a parallel setting. CUBES splits the synthesis problem into disjoint sub-problems 
to be solved in parallel by each of the available processes. Hence, each process 
focuses solely on a particular area of the search space. 


4 A detailed description of the DSL is available in the extended version of this paper [3]. 
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Figure 2: CUBES’ architecture for divide-and-conquer. 


In our context, each sub-problem is represented by a cube: a sequence of 
operations from CUBES’ DSL such that the arguments for the operations are 
still to be determined. Consider the following cube as an example: [filter, 
natural_join], which represents the section of the search space composed by 
programs with two operations, where the first is a filter (equivalent to a WHERE 
in SQL) and the second is a natural_join. 

The overall architecture of CUBES is illustrated in Figure 2. The Cube Gen- 
erator component is responsible for generating cubes in increasing size (i.e., first 
the cubes with one operation, then with two operations, and so forth), building 
a FIFO queue. Observe that since each cube corresponds to a distinct sequence 
of operations, there is no intersection in the search space of the different cubes. 
Then, each process receives a specific cube and checks if it is possible to fill 
in the missing arguments (e.g., columns, tables, filter conditions) to satisfy the 
input-output examples. Whenever a process finds a solution, the translation layer 
transforms the R program into SQL. Otherwise, if a cube cannot be extended 
into a complete program that satisfies the user specification, the process gets a 
new cube from the Cube Generator queue. 


Dynamic Cube Generation. One approach for a cube generation heuristic is to 
define a static order of operations to be explored. Although a static heuristic can 
be effective on some specific domains, it is very unlikely that it generalizes to new 
instances. Therefore, CUBES uses a dynamic cube generator inspired by natural 
language techniques. Since candidate programs are constructed as a sequence of 
operations, a bigram prediction model can be used to decide the next operation 
to be chosen in a given sequence. Therefore, when choosing the next operation, 
the operation immediately preceding it is used to compute an expectation of 
which of the possible choices will lead to the desired program. 


Program scoring. The initial scores of the bigram can be improved during the 
search by using information from programs that do not satisfy the examples. For 
a given program p, we compute the score of the program p as the percentage 
of elements of the expected output (according to the provided example) that 
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appear in the output of p. A score of 1 indicates that all the expected values 
occur in the output, and as such, filtering or restructuring might lead to a correct 
program. On the other hand, a value of 0 means that the candidate program is 
likely very far from a correct solution. 

For each evaluated program, the score, score(p), is used to update the bigram 
scores. A high score for a given program, p, means that CUBES will generate new 
cubes similar to the one that originated the program p. On the other hand, a 
low score means that CUBES will try to diversify the search in the future. 


DSL Splitting. Besides the splitting of the search space using cubes, CUBES 
also splits the DSL operations among the processes. The motivation for this 
additional split is that some DSL operations have more possible argument com- 
pletions than others. For instance, there are many more ways to complete an 
inner_join operation than, for example, a filter operation. If the program to 
be synthesized does not require some of the complex operations, then we can 
solve this program more quickly with a smaller DSL. To ensure that CUBES can 
always find the correct program, at least one process always runs with the entire 
DSL while the other processes may contain only subsets of the DSL. 


3 Accuracy and Disambiguation 


An essential issue in program synthesis is knowing if the returned program cor- 
responds to the user intent. To determine the accuracy of the synthesis tools, we 
call the query that the user wishes to obtain the ground truth query. Observe that 
SQL synthesis tools that use input-output examples return a query that satisfies 
the user’s examples. However, these examples are an under-specification, and as 
such, the returned query might not satisfy the true user intent. 

CUBES may find multiple queries that satisfy the examples. However, unless 
these queries are equivalent, only one of them matches the user’s intent. To 
address this challenge, we create new examples with different input-output pairs 
for the synthesized queries and interact with the user to disambiguate the correct 
query. Next, we describe how to use fuzzing to create new examples and our di- 
sambiguation procedure to improve CUBES’s accuracy and meet the user intent. 


3.1 Fuzzing 


Given a set of synthesized queries, our goal is to determine which one matches 
the user intent. Since some of them may be equivalent, multiple queries may be 
correct. One approach is to use query equivalence tools to check the equivalence 
of these queries and only consider a representative query of each equivalence 
class. Although recent work in query equivalence tools [6,38,5] has advanced the 
state-of-the-art, these tools remain incomplete, not supporting many complex 
queries present in our datasets. To overcome this limitation, we use a fuzzing- 
based approach to determine the approximate equivalency of different queries. 
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Consider a synthesis problem with an input-output example (J, O) and let Q1 
and Q2 be two queries that satisfy this example. Fuzzing consists of taking the 
input J, slightly modifying it, and producing I’. Next, we apply both Qı and 
Q2 to I’ producing the outputs O} and O4, respectively. If the outputs differ 
(O1 # 03), then Qı and Q2 are surely distinct. However, if the outputs are 
equal (O1 = O4), we cannot conclude that the queries are equivalent. Hence, we 
perform several rounds of fuzzing, generating and testing different inputs, with 
each round increasing the confidence in our answer. 

In order to produce fuzzed input-output examples, we use the Semantic Eval- 
uation suite [37]. Consider a table, T € I. In order to generate a fuzzed version 
of this table, T” € I’, the suite starts by randomly selecting the number of rows 
of the new table. Then, to fill the cells of T’, three sources are used: (1) values 
sampled from a uniform distribution for the given type (i.e., for integers a uni- 
form distribution on [—2°3, 263 — 1]), (2) values taken from the corresponding 
columns on the original table, T, and closely related values (i.e., if “Alice” is in 
T then both “Alice” and “Alicegg” might be considered for T’), and (3) values 
taken from the queries we are comparing, and closely related values. The reason 
why the suite takes into account values from the queries themselves is to increase 
code coverage (e.g., making it more likely to find off-by-one errors). Finally, all 
foreign keys are respected so that the semantics of the database are preserved. 


3.2 Disambiguation 


CUBES is able to return multiple queries that satisfy the user specification. How- 
ever, if the example provided is an under-specification of the true user intent, 
those queries will most likely have slightly different semantics. In order to ease 
the burden on the user of selecting a correct query, we propose a disambiguation 
algorithm, shown in Algorithm 1. 

CUBES starts by synthesizing all possible solutions under a given time limit. 
The goal of the disambiguation is then to ask the user questions in order to 
iteratively discard queries until we find one that satisfies the user intent. Our 
procedure attempts to minimize the number of questions as much as possible, by 
trying to discard approximately half of the queries each time we ask a question. 

To do this, we start by generating a new input database J’ through fuzzing. 
Next, we execute each of the synthesized queries on this new input I’ and group 
them according to the output they produce. In each disambiguation step, we 
generate 16 new input databases, by performing fuzzing 16 times, and selecting 
the input-output example that is closest to splitting the set of queries in half. 

Figure 3 shows a real-world disambiguation interaction. Initially, we have 7 
queries found by CUBES that satisfy the original input-output example. In this 
case, we generate a new input J’ such that 1 of the 7 queries provides the output 
table A’, 3 queries provide as output table B’, and 3 others provide an output 
C’. Then, we ask the user if the new input-output example (J’, B’) is correct. If 
the user answers yes, then the solution is one of the 3 queries. Otherwise, the 
solution should be one of the 4 remaining queries. Since the user answered yes, 
then 3 queries remain to disambiguate. The disambiguation procedure terminates 
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Algorithm 1: Disambiguation method 


Input: S, the set of synthesized queries, J, input database, 
O, output table, R, number of fuzzing rounds 
Result: a query considered to be the most likely solution 
Disambiguate(S,/,O, R) 
1 bestSplit + @; 
2 for i + 1 to R do 
3 I' + Fuzz(/, S); 
4 split + GroupByOutput (S, T’); 
5 if BetterSplit (bestSplit, split) then 
6 | bestSplit < split; 
end 
f bestSplit = Ø then 
return First(S); 
(I’, Sa, O'4, SB) + bestSplit; 
10 if AskUserIfExampleIsCorrect (I’, O',) then 
11 return Disambiguate(Sy4, I, O, R); 
12 else 
13 return Disambiguate(Sp, I, O, R); 


mie 
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Figure 3: Example disambiguation process from a problem that generated 7 pos- 
sible queries. Blue boxes represent the input-output example given to the user. 


when either there is only one query remaining or the fuzzing procedure is unable 
to find a new example to distinguish the remaining queries. In the latter case, 
the remaining queries are deemed equivalent and the first one found by CUBES 
during the search is returned to the user. Notice that CUBES enumerates queries 
in increasing order of the number of operators. Hence, the first queries to be 
found by CUBES have the fewest operations and should be more general. 


4 Methods and Data 


This section describes the benchmark sets used to evaluate CUBES and com- 
pare it to other synthesizers, as well as two distinct methods to perform that 
comparison: simple evaluation and fuzzy-based evaluation. 


Data. We use five different benchmark sets, divided into two groups. The first 
group, consisting of the benchmarks recent-posts, top-rated-posts, textbook 
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Algorithm 2: Query checker using fuzzing 
Input: q, the synthesized query, Q, the ground truth query, 
I, input database, R, number of fuzzing rounds 
Result: a Boolean representing if a distinguishing input was not found 
FuzzyCheck(q, Q, I, R) 
if Execute(Q, I) # Execute (q, I) then 
2 | return False; 
3 for i 4+ 1 to R do 
4 I’ & Fuzz(I, Q); 
5 
6 


m 


if Execute(Q, I’) 4 Execute(q, J’) then 
| return False; 

end 

return True; 


_ 


and kaggle refers to benchmarks that were previously used in other example- 
based SQL synthesis papers [32,36,20,27]. The second group consists of a sin- 
gle benchmark set: spider. We adapted the instances in spider from a very 
large and diverse dataset of queries used for SQL synthesis from Natural Lan- 
guage (NL) descriptions (also known as text-to-SQL) [35]. Overall, we used 176 
instances from previously established benchmark sets, and created 3690 new 
instances. 


Simple Evaluation. In this setting, we are simply interested in checking if a 
synthesizer can produce a query that satisfies the specification given by the 
user. That is, when executed, the query should produce an output table that is 
equal to the one specified by the user. Furthermore, we do not take into account 
the row order of the output table. This method has been extensively used in the 
past to measure the performance of SQL synthesizers [32,36,20,27]. The problem 
with simple evaluation is that, in the case of an ambiguous example, it does not 
address whether the synthesized query actually satisfies the user intent or not. 


Fuzzy-based Evaluation. In this setting, we check if the synthesized queries satisfy 
the true intent of the user and not just the input-output example. The motive for 
this distinction is that the input-output example might be an under-specification 
of the query the user wishes to obtain. That is, several queries can satisfy the 
example, but they do not have the same semantics. 

Algorithm 2 shows how we use fuzzing, as introduced in subsection 3.1, to 
determine if two queries are likely to have the same semantics. We start by sanity 
checking if the synthesized query, g, and the ground truth query, Q, produce the 
same output for the provided input database, I (lines 1-2). Then, we perform 
R rounds of fuzzing (line 3), where for each round, we generate a new input 
database, I’, and check if the two queries still produce the same output table 
(lines 5-6). If all rounds pass successfully, we consider the queries equivalent 
(line 7). When comparing two tables, we perform a very lax comparison that: 
(1) ignores row order — tables are seen as a multiset of rows, (2) ignores column 
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names, and (3) tries to convert the datatypes of columns — if two columns contain 
the same data but one as a number and the other as a string, they are considered 
equivalent. Note that several rounds might be needed to find an input that 
distinguishes the queries. The parameter R controls the maximum number of 
fuzzing rounds until the algorithm deems the queries equivalent. 


5 Evaluation 


The evaluation presented next aim to answer the following research questions: 


Q1. How does the sequential version of CUBES, CUBES-SEQ, compare with other 
state-of-the-art SQL synthesizers when using the simple evaluation metric? 
(subsection 5.2) 

Q2. What are the speedups obtained by using the divide-and-conquer approach, 
CuBES-DC, when using the simple evaluation metric? (subsection 5.3) 

Q3. How do CUBES and the other SQL synthesizers perform when using the 
fuzzy-based evaluation metric? (subsection 5.4) 

Q4. What is the impact of program disambiguation in CUBES’ fuzzy-based eval- 
uation metric? (subsection 5.4) 


All results were obtained on a dual socket Intel® Xeon® Silver 4210R @ 
2.40GHz, with a total of 20 cores and 64GB of RAM. Furthermore, a limit of 
10 minutes (wall-clock time) and 56GB of RAM was imposed on all synthesizers 
(sequential or parallel). All limits were strictly imposed using runsolver [22]. 


5.1 Implementation 


CUBES is implemented on top of the Trinity [15] framework, using Python 3.8.3. 
Candidate programs are evaluated by translating the DSL operations into equiv- 
alent R instructions. In particular, the tidyverse® family of packages is used to 
implement table manipulations. Once a correct R program is found, the dbplyr® 
package (version 1.4.4) is used to translate that program to an equivalent SQL 
query. In the parallel synthesizer, inter-process communication is achieved us- 
ing a message-passing approach through Python’s multiprocessing pipes. All 
source code, instance files, and execution logs are made publicly available.” 

We use the fuzzing framework developed by Zhong et al. [37] in our disam- 
biguation module to perform accuracy analysis. Furthermore, queries are exe- 
cuted using the SQLAIchemy® library (version 1.3.20), and row order is ignored 
when comparing tables. The original implementation of the fuzzing framework is 
non-deterministic, so we modified it in two important ways: (1) we added proper 
seeding for Python’s pseudo-random number generator, and (2) we replaced all 


5 https: //www.tidyverse.org/ 

° nttps://dbplyr.tidyverse.org/ 

T nttps://doi.org/10.5281/zenodo. 10492998 
8 https: //www.sqlalchemy.org/ 
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Figure 4: Percentage of instances solved by each tool at each point in time. A 
mark is placed every 150 solved instances. 


usages of the set data structure with OrderedSet (sets backed with a list so that 
the iteration order is deterministic). This change was needed so that both the 
accuracy results presented in the paper and CUBES’ disambiguation process are 
deterministic. The modified framework is also included in CUBES’ source files. 


5.2 Sequential Performance using Simple Evaluation 


We start by evaluating the performance of CUBES-SEQ, the sequential version of 
CUBES, and perform a comparison with other state-of-the-art SQL Programming 
by Example (PBE) tools: SQUARES [20], SCYTHE [32] and PATSQL [27]. Figure 4 
shows the percentage of instances solved by each synthesizer as a function of time 
when using the simple evaluation method. Overall, SQUARES was able to solve 
30.6% of the instances within the time limit of 10 minutes, while SCYTHE solved 
49.5% and PaTSQL solved 75.1%. CUBES-SEQ was able to solve 79.4%. 

Figure 4 also shows the Virtual Best Solver (VBS) for these four synthesizers. 
The VBS can be seen as the result of running the four synthesizers in parallel, 
or, equivalently, having an oracle that predicts which synthesizer is the best for a 
given instance and using it. The VBS is able to solve more instances than any of 
the other synthesizers (92.7% vs. the 79.4% for CUBES). This shows two things: 
(1) not all synthesizers solve the same instances, and (2) it is advantageous to run 
multiple synthesizers in parallel if the user has the resources for it. Furthermore, 
if we consider a VBS with only the top-performing synthesizers, PATSQL and 
CUBES, the percentage of solved instances is 90.5% (vs. 92.7% with the four 
synthesizers), meaning that using two synthesizers in parallel results in 10%+ 
extra instances solved compared to just using CUBES. 

One interesting difference between these synthesizers is the minimum time 
in which they can return a solution for any of the instances, with SCYTHE and 
PaTSQL at around 0.3 seconds, while SQUARES and CUBES only solve the first 
instance at 2 to 3 seconds. The most likely explanation for this difference is the 
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Table 1: Overall results for 10 seconds and 10 minutes grouped by benchmark. 
The best tool for each time-limit/benchmark pair is highlighted in bold. 


o 
x 
oS 
o o 
eo a 
R 2 a 
& P ws 2 F Median 
n 
Run Ka s? | Rg x? All Speedup 
10 seconds 
SQUARES 21.2% 3.9% 5.3% 24.7% 28.6% 24.1% 
SCYTHE 0.0% 49.0% 66.7% 22.5% 28.6% 23.4% 


PaTSQL 57.6% 41.2% 64.9% 72.5% 62.9% 71.7% 
CuBES-SEQ 15.2% 11.8% 33.3% 51.5% 34.3% 50.3% 
CuBes-DC4 24.2% 11.8% 59.6% 70.0% 48.6% 68.5% 
CuBEs-DC8 27.3% 15.7% 63.2% 73.2% 54.3% 71.8% 
CuBEs-DC16 24.2% 19.6% 63.2% 75.4% 51.4% 73.8% 


10 minutes 
SQUARES 21.2% 7.8% 22.8% 31.0% 40.0% 30.6% 
SCYTHE 3.0% 66.7% 80.7% 49.1% 54.3% 49.5% 
PATSQL 63.6% 45.1% 66.7% 75.8% 68.6% 75.1% 
CUBES-SEQ 39.4% 25.5% 66.7% 80.9% 57.1% 79.4% (1 x) 


Cupes-DC4 45.5% 31.4% 73.7% 88.4% 71.4% 86.9% 8.4 x 
CuBES-DC8 54.5% 39.2% 73.7% 89.6% 68.6% 88.2%  12.8x 
CuBES-DC16 51.5% 39.2% 75.4% 90.4% 77.1% 89.0%  15.5x 


startup time for the programming languages used by the synthesizers. PATSQL 
and SCYTHE both use Java, while SQUARES and CUBES use Python and also 
need to initialize the R execution environment. Figure 4 also shows that both 
SCYTHE and CUBES-SEQ are able to solve more problem instances when we 
increase the time limit, while PATSQL and SQUARES seem to reach a plateau. 

Table 1 shows the results for each benchmark set with virtual time limits of 10 
seconds (top half) and 10 minutes (bottom half). We can see that CUBES-SEQ is 
able to solve more instances than SQUARES in all benchmarks sets while solving 
more instances than SCYTHE in 3 out of 5 benchmark sets. When comparing with 
PaTSQL, the results shown in Figure 4 are confirmed since although PATSQL 
solves more instances with a shorter time limit, CUBES-SEQ is able to solve more 
instances in one benchmark set (spider) with a larger time limit. 


5.3 Parallel Performance using Simple Evaluation 


Considering the sequential version CUBES-SEQ as our baseline, we now evaluate 
the performance of the parallel version using divide-and-conquer (CUBES-DC). 

Table 1 shows the results for the divide-and-conquer strategy CUBES-DC 
with 4, 8, and 16 processes. Notice that divide-and-conquer tools improve upon 
the sequential version, from 79.4% up to 89.0% when using 16 processes. More- 
over, within a limit of 10 seconds, the parallel versions are able to solve 68.5%, 
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Figure 5: Instance speedup distribution for CUBES-DC16. 


71.8%, and 73.8% of the instances when using, respectively, 4, 8, and 16 pro- 
cesses. This contrasts with the sequential version that only solves 50.3% of the in- 
stances. Hence, there is a significant speedup when using the divide-and-conquer 
strategy, especially for shorter time limits. Observe that even within the time 
limit of 10 seconds, CUBES-DC is the best-performing solver. 

Formally, the speedup of method A in relation to method B is defined as the 
time needed to execute method B divided by the time needed to execute method 
A, and is a measure of how fast an implementation is compared to another. The 
last column of Table 1 shows the speedup obtained by each parallel version 
of CUBES in relation to the sequential version CUBES-SEQ for instances where 
CUBES-SEQ needed 1 minute (or more) to solve. We focus this analysis on the 
harder instances for the sequential tool since higher speedups in these instances 
have a higher impact on the end user’s experience. 

We can see that most configurations have a median speedup greater than 
the number of processes used. This is called a super-linear speedup and occurs 
because programs are enumerated in a different order when using our parallel 
versions. Figure 5 shows the full speedup distribution for CUBES-DC16 along 
with the distribution quartiles. We can see that more than 50% of instances 
have a speedup greater than 10 when using 16 processes, while more than 25% 
of instances have a speedup greater than 30. 


5.4 Results using Fuzzing-based Evaluation 


In this section we analyze the number of instances solved by CUBES when using 
the more thorough fuzzy-based evaluation, as well as comparing it with other 
program synthesis tools. Furthermore, we also evaluate the program disambigua- 
tor introduced in section 3. 

Figure 6 shows the results when using the fuzzy-based evaluation method 
instead of the simple evaluation. For this evaluation, we used 16 fuzzing rounds 
(R = 16). The “FuzzyCheck Timeout” label in the plot represents instances for 
which the fuzzing evaluation timed out and not a timeout of the synthesizer 
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Figure 6: Results of the fuzzy-based evaluation for each synthesizer. 
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Figure 7: Fuzzy-based evaluation results before and after disambiguation. 


used. We used a time limit of 60 seconds per fuzzing round (16 x 60s = 960s). 
Furthermore, some of the synthesized queries failed to execute (labelled as “Ex- 
ecution Error”). This happens for two reasons: (1) some synthesized queries are 
incompatible with the SQLite dialect, and (2) some of the synthesized queries 
contain syntax problems. 


We label instances for which we could not find a distinguishing input from 
the ground truth as “Possibly Correct”, while instances for which we did find 
such input are labelled as “Incorrect by Fuzzing”. Furthermore, for synthesizers 
that return multiple solutions, “Possibly Correct Top 5” means that there was 
a query in the top-5 returned queries for which we did not find a distinguishing 
input from the ground truth. Similarly, “Possibly Correct Any” means that the 


Towards Reliable SQL Synthesis: Fuzzing Evaluation and Disambiguation 245 


Table 2: Comparison of the fuzzy-based evaluation with the simple evaluation. 


SCYTHE SQUARES PaTSQL CuBES-SEQ CUBES-DC16 
All Solutions All Solutions 


Solved (simple eval.) 49.5% 30.6% 75.1% 79.5% 90.2% 
Possibly Correct 21.6% 9.2% 37.1% 58.0% 63.3% 
as % of Solved instances 43.6% 30.0% 49.4% 73.0% 70.2% 
Incorrect by Fuzzing 11.6% 8.4% 32.3% 10.7% 14.1% 
as % of Solved instances 23.4% 27.5% 43.0% 13.5% 15.6% 
Inconclusive 16.2% 13.1% 5.7% 8.9% 10.2% 


as % of Solved instances 32.7% 12.8% 7.6% 11.2% 11.3% 


“ Includes instances in Possibly Correct Top 5 and Possibly Correct Any. 


synthesizer returned a query for which we could not distinguish it from the 
ground truth. 

Previous tools all suffer from fairly low accuracy rates, staying under 45%, as 
do CUBES-SEQ and CUBES-DC16 if we only consider the first solution returned. 
However, if we consider all solutions returned under 10 minutes, then CUBES 
generates a correct (using fuzzy-based evaluation) solution on around 63% of 
the instances, as shown in Table 2. 

In order to be able to give that correct solution to the user, as opposed 
to giving them all the solutions generated, we developed a query disambigua- 
tor. Figure 7 shows the results of using that disambiguator on CUBES-SEQ and 
CuBES-DC16. We can see that the disambiguator can almost always identify 
the correct query if such a query exists in the set of queries synthesized. Note 
that small differences in the exact number of queries deemed correct using the 
fuzzy-based evaluation may be due to different fuzzed inputs being generated. 

It is also worth noting that a very small number of instances are labeled as 
“Possibly Correct Top 5”. As explained in Section 3, CUBES returns the earliest 
synthesized query when we reach a set of queries that we cannot distinguish from 
one another. This means that, for those instances, a correct query was in the 
final set of queries selected by the disambiguation, but it was not the first one 
generated by CUBES. This happens because while the accuracy test has access to 
the ground truth and can thus generate better-fuzzed inputs, the disambiguator 
is limited to using values from the queries it is trying to disambiguate. Even so, 
the fact that this only occurs in a very small number of queries indicates that 
the approach is valid and seems to be able to both correctly disambiguate most 
queries and catch the cases where the disambiguation fails. 

We show that if we only consider the first solution, CUBES’ performance 
is similar to other existing tools. The main improvement comes from (1) syn- 
thesizing many possible queries for a given problem and (2) having a program 
disambiguator to choose the right query. This first point is directly influenced by 
our parallel approach to program synthesis, which allows us to synthesize more 
programs that satisfy the examples under the chosen time limit. 
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Figure 8: Number of questions that need to be asked to the user in order to 
perform disambiguation, as a function of the number of queries synthesized. 


Finally, we analyze how many questions are asked to the user to disambiguate 
the queries produced by CUBES. Figure 8 shows this data as a function of the 
number of queries synthesized. Consider the first bar of the second group, relating 
to instances where CUBES-SEQ generated 11 to 100 queries. The plot shows that 
to disambiguate those queries, we need at least 1 question, at most 11 questions, 
and on average 3 questions. 

For CUBES-SEQ the average number of questions needed to disambiguate up 
to 1000 queries is 2.31, while for CUBES-DC16 it is 2.69. As stated in Section 3, 
our goal with the disambiguation strategy is to discard half the queries with each 
question asked. Thus, we would expect that the number of questions needed to 
disambiguate a given set of queries scales logarithmically with the size of that 
set. Figure 8 shows that this behavior is, in fact, observed in practice. 


6 Discussion 


Here we discuss the main threats to validity of this work and some challenges 
that were raised during the experimental evaluation. 


Benchmarks. Our evaluation uses a large set of benchmarks from different do- 
mains. However, they may not be representative of tasks commonly performed 
by users or may have a bias towards a specific synthesis tool. To mitigate this, 
we included benchmarks from several previous synthesis tools and also extended 
a large dataset from query synthesis using NLP to use examples instead. In 
the end, we have around 4000 instances but they are dominated by the spider 
dataset [35]. Nevertheless, since this dataset has been extensively used in other 
domains and was not created by us, we believe that it is more general and less 
prone to bias. 
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Parallelism. The divide-and-conquer approach already shows scalability for hard 
instances when using 4 and 8 processes in a multicore architecture with super- 
linear speedups. However, when increasing the number of processes to 16 the 
gains are reduced. When the number of processes increases, there is an increase 
of contention for memory accesses that can slow down the performance of each 
process. To address this issue, it would be interesting to evaluate CUBES in 
a distributed setting. Note that the overhead of going from multicore to dis- 
tributed should be small since the inter-process communication is already done 
using message-passing techniques, and no shared memory is used. Exchanging 
information between processes is another source of improvement that would be 
worth exploring in future work. 


Cube generation. One way to further improve the divide-and-conquer approach 
is to consider other cube generation strategies. For instance, we could learn from 
data and use machine learning techniques such as pre-trained bigram scores or 
using neural networks to predict the most likely cubes. We could also explore 
other techniques similar to the ones used in SAT solvers, such as restarting the 
search after n programs/cubes have been attempted. 


Fuzzy-based Evaluation. Even though query synthesis tools are becoming more 
efficient and can find a query that satisfies the input-output example given by 
the user, they may not find the query that the user intended. To the best of our 
knowledge, this is the first study where fuzzing was used to evaluate if the query 
returned by the synthesizer matches the user’s intent. Even though fuzzing is not 
a precise measurement of correctness since it may return that some queries are 
equivalent when they may not be, it is an upper bound on the accuracy of these 
tools. With the continuous improvement of SQL equivalence tools [6,38,5], it 
may be possible to have an exact accuracy measurement in the future. However, 
even with the current results, we already observe that all synthesis tools return 
many answers that do not match the desired behavior. 


Disambiguation. Interacting with the user to perform query disambiguation is 
essential to increase the accuracy of SQL synthesizers based on examples. How- 
ever, the questions that we asked the user may be too hard to answer, or the 
user may answer them incorrectly. To mitigate the difficulty of the questions, 
we only ask yes or no questions and present examples based on fuzzing that are 
often similar to the initial example provided by the user. With this approach, we 
hope that the user can quickly answer these questions. We currently automate 
the disambiguation procedure and use the ground truth to answer the questions, 
but a user study could be done in the future to confirm our hypothesis that 
the questions are easy for users to answer. In this work, we assume that the 
user never answers the questions incorrectly. However, considering this scenario 
could open new research directions and is in line with recent work on program 
synthesis with noisy data [11] where the examples may be incorrect. 
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7 Related Work 


SQL Synthesis. In recent years, several tools for query synthesis have been pro- 
posed using input-output examples to specify user intent [28,36,7,32,15,20]. Solv- 
ing approaches vary from using decision trees with fixed templates [28,36] to 
abstract representations of queries that can potentially satisfy the input-output 
examples [32]. Another approach is to use SMT-based representations of the 
search space [7,19] such that each solution to the SMT formula represents a 
possible candidate query to be verified. The CUBES framework proposed in this 
paper is also based on SMT-based representations, but it extends prior work in 
several dimensions: (i) extends the language in the programs to be synthesized, 
(ii) proposes pruning techniques that can be directly encoded into SMT, and 
(iii) it is the first parallel tool for query synthesis. 

In this paper, we compare CUBES with three other SQL Synthesis tools 
that use input-output examples: SCYTHE [32], SQUARES [20] and PATSQL [27]. 
SCYTHE and PATSQL use sketch-based enumeration, where first a skeleton 
program with missing parts is generated, and then, if the skeleton satisfies a 
preliminary evaluation, the synthesizer tries to complete the sketch to obtain 
a complete program. SQUARES, on the other hand, uses Satisfiability Modulo 
Theories (SMT)-based enumeration where complete programs are obtained by 
iterating the possible solutions of an SMT formula. Both SCYTHE and SQUARES 
have limited DSLs and thus are not as well suited for complex tasks. Further- 
more, SCYTHE’s ability to solve a given instance is severely limited by the size of 
its input tables. Although PATSQL has a comparatively more expressive DSL, 
it is still not able to outperform CUBES. 

Another approach for specifying user intent is using natural language [33,30]. 
However, these approaches often need a large training data set from the query’s 
domain. Recently, several techniques have been proposed that try to better gen- 
eralize to cross-domain data [34,24]. Although many improvements have been 
attained in finding the structure of the query through effective semantic ta- 
ble parsing, defining the details (e.g., specific filter conditions) is usually hard, 
particularly in more complex queries. The use of natural language for query syn- 
thesis is complementary to our approach, and a combination of both strategies 
could improve the accuracy of program synthesizers at the cost of more input 
from the user, namely examples and a natural language description of the task. 


Program Disambiguation. Current synthesizers focus primarily on generating 
programs that satisfy the user’s specifications. However, in many situations, the 
produced program does not satisfy the true user intent [16,26]. Previous work 
has shown that this shortcoming can be solved without recurring to complete 
specifications by introducing a program disambiguator. This component is re- 
sponsible for interacting with the user and choosing between several possible 
solutions. Mayer et al. [16] describe two types of user interaction for program 
disambiguation: in the first approach, users select the correct program among a 
set of returned solutions, which are presented in a way that allows easy naviga- 
tion. The second approach is described as conversational clarification, where the 
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system iteratively asks questions to the user, further refining the original speci- 
fication until just one candidate program is left [8,21,14,31,13,17]. In CUBES, we 
use conversational clarification to improve the confidence in produced solutions 
while still keeping the complexity for the user low. 


Parallel Solving. Solving logic formulas in parallel has been the subject of ex- 
tensive research work [10,9,1,2], both using memory-shared [25] and distributed 
approaches [18]. One of the techniques used to explore the search space is called 
divide-and-conquer [12]. In this approach, the search space is split into disjoint 
areas such that there is no intersection between the areas explored by each pro- 
cess. In this case, work-stealing techniques [23] are commonly used to avoid 
starvation since the search space can be unevenly split among the processes. 
Although we adapt techniques from parallel automated reasoning, the paral- 
lelization in the CUBES framework is not done at solving logic formulas but at a 
more abstract level. In our case, logic formulas continue to be solved sequentially. 
Moreover, starvation is avoided by producing additional work, i.e., increasing the 
number of operations from the DSL in the programs to be enumerated. 


8 Conclusions 


This work introduces CUBES, a new enumeration-based framework for query 
synthesis from examples. A new robust tool is proposed that is able to synthesize 
an extensive range of SQL queries. Additionally, CUBES also takes advantage of 
the current multicore processor architectures, providing the first parallel query 
synthesizer from examples using a divide-and-conquer approach. The splitting 
of the program space is done by providing different sequences of operations to 
each thread, as well as performing DSL splitting among threads. 

An in-depth experimental evaluation is also carried out, comparing CUBES 
with other state-of-the-art query synthesizers in a wide variety of benchmark 
sets. Experimental results show the effectiveness and robustness of CUBES, be- 
ing able to successfully synthesize SQL queries for a larger range of problem 
instances than other tools. Moreover, the parallel versions of CUBES have super- 
linear speedups for many hard instances and, when using 16 processes, provide 
a median speedup of 15x over the sequential version of the tool. 

Finally, an accuracy analysis of the produced queries is also performed using 
fuzzing techniques. Results show that the queries produced by current synthesiz- 
ers often differ from the user intent, and more than 50% of the queries returned 
to the user do not match the expected behavior the user had in mind. To in- 
crease the trust and reliability of SQL synthesizers, we advocate the need to use 
a fuzzing-based evaluation that can more precisely measure the accuracy of SQL 
synthesizers. Using this methodology together with the large dataset that we 
collected will make it easier for other researchers to evaluate their SQL synthesis 
tools in the future. 

Since examples are imprecise specifications, increasing the trust and relia- 
bility of SQL synthesizers is essential. To improve the reliability of CUBES, we 
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propose an interactive procedure with the user that can disambiguate among all 
queries found by CUBES that satisfy the original input-output example. After the 
disambiguation procedure, the accuracy of CUBES in providing the user intent 
query is significantly increased from around 40% to 60%. Other synthesizers can 
use similar disambiguation approaches, and it is also expected to improve their 
accuracy with respect to the user intent. 
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Abstract. This paper describes a formal general-purpose automated 
program repair (APR) framework based on the concept of program in- 
variants. In the presented repair framework, the execution traces of a de- 
fected program are dynamically analyzed to infer specifications Yeorrect 
and Yviolated; Where Yeorrect represents the set of likely invariants (good 
patterns) required for a run to be successful and Yviolatea represents the 
set of likely suspicious invariants (bad patterns) that result in the bug in 
the defected program. These specifications are then refined using rigor- 
ous program analysis techniques, which are also used to drive the repair 
process towards feasible patches and assess the correctness of generated 
patches. We demonstrate the usefulness of leveraging invariants in APR 
by developing an invariant-based repair system for performance bugs. 
The initial analysis shows the effectiveness of invariant-based APR in 
handling performance bugs by producing patches that ensure program’s 
efficiency increase without adversely impacting its functionality. 


Keywords: Automated program repair - Invariant learning and refinement - 
Patch overfitting - Program verifier +- CPAChecker - Performance bugs 


1 Introduction 


Automated program repair (APR) has recently gained great attention because it 
helps to significantly decrease manual debugging effort by automatically generat- 
ing patches for defected programs. Modern program repair tools have been shown 
to be effective at fixing bugs in many real-world programs. The poor quality of 
automatically generated patches [11], however, continues to be a major obstacle 
to the adoption of automated program repair by software practitioners. 

Problem: The primary reason for the low quality of automatically generated 
patches by current APR tools is the lack of specifications of the intended be- 
havior. Most program repair systems rely on tests as the correctness criteria, 
because a formal specification is not explicitly provided by software developers. 
Therefore, current APR approaches produce plausible patches which must be 
(manually) inspected before being deployed. As a result, there is no guarantee 
that the generated patches are generally correct and do not introduce new bugs. 
Solution: Program verification technology enables developers to prove the cor- 
rectness of the program before deploying it. One of the key activities underlying 
this technology involves inferring a program invariant—a logical formula that 
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serves as an abstract specification of a program. Developers can significantly 
benefit from program invariants to identify program properties that must be 
preserved when modifying code. Unfortunately, these invariants are typically 
absent from code, leading to the dominance of less rigorous APR approaches 
(e.g., dynamic APR) and the well-known patch overfitting challenge [11]. 

We argue that by using test cases and reachability-based analysis techniques, 
an accurate set of invariants may be obtained and utilized to produce high- 
quality patches. In other words, program verification tools such as CPAChecker 
[3] and PathFinder [15] can be used to refine the dynamically generated invariant 
candidates. This can be done by first using the test cases to analyze the execution 
traces of the program to infer a set of invariant candidates. These candidates are 
then refined using a program verifier to obtain more accurate invariants. The 
goal is to infer two specifications: (i) Yeorrect, Which represents the set of good 
patterns required for a run to succeed, and (ii) Yviolatea, Which represents the 
set of bad patterns that lead to the target bug. Invariant-based APR offers two 
key benefits. First, it directs APR towards potentially feasible patches. Second, 
it enables the formal validation of plausible patches using program verifiers. 
Viability of invariant-based APR: Program invariants have shown effective- 
ness in many applications, such as program understanding, fault localization, 
and formal verification. Invariants are effective because functional correctness 
relates to the final result of a program rather than any specific implementation. 
They can therefore assist in abstracting many concrete execution steps and thus 
greatly reduce the effort needed to reason about the patch’s correctness. 

In fact, developers who aim to repair a defected undocumented program (a 
program written without thought for formal specifications) can find invariant- 
based APR helpful in their repair tasks. The availability of mature automated 
invariant detection tools like Daikon [4] and practical software verification tools 
like CPAChecker and PathFinder makes the invariant-based program repair tech- 
nique viable. At first glance, refining invariants using program verification tools 
seems too expensive. However, due to tremendous advances in software verifica- 
tion [2], in practice, invariant-based verification can be made pretty efficient. In 
particular, the software analysis framework CPAChecker, which supports many 
different reachability analyses, has been effectively used to validate a wide vari- 
ety of reachability queries against C programs with up to 50K lines of code. This 
makes reachability analysis a promising technique that can be used to signifi- 
cantly reduce the patch overfitting problem and produce high-quality patches. 


2 Invariant-based Program Repair Framework 


In this section we reformulate the APR problem using the concept of program 
invariants. We then describe how one can analyze the execution traces of fault- 
free runs to infer likely specifications of the program’s intended behaviour and 
execution traces of faulty runs to infer likely suspicious invariants that lead to the 
faulty behaviour. Before proceeding further, let us introduce some definitions. 
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Definition 1. (fault-free vs. faulty runs). Let P be a buggy program, R be 
the set of runs of P, and pben be a property of program P’s intended behavior. 
We say that a run r E€ R is a successful run (i.e., fault-free run) if P(r) = Pren. 
On the other hand, we say that a run r' E€ R is a faulty run if P(r) E pven. 


From Definition 1 we note that by analyzing information extracted from fault- 
free runs, one might be able to infer a specification of the program’s intended 
behavior. Similarly, by analyzing the execution information of faulty runs, one 
might be able to deduce the violating invariants that cause the bug. This is be- 
cause fault-free runs represent runs in which program invariants are maintained, 
while faulty runs represent runs in which some program invariants are violated. 


Definition 2. (Invariant-based APR problem). Let P be a program con- 
taining bug b and T = (Tp U Tr) be a test suite, where Tp represents the set 
of passing tests and Tr represents the set of failing tests. Let D be a dynamic 
invariant inference tool like Daikon, and V be a program verification tool like 
CPAChecker. The invariant-based APR process consists of the following steps: 


1. [Invariant extraction]. Generate an initial set of invariants T for P using D. 
2. [Invariant refinement]. Refine the set T using V to produce specifications 
Yeorrect ANd Yuiolatea- This can be done by asserting invariants at a program’s 
location of interest and using any generated counter-example to refine them. 
3. [Fault localization]. Compute a list of suspicious statements whose mutation 
may lead to a valid patch by analyzing specifications Yeorrect and Pviolated: 
4. [Patch generation]. Construct code that corrects the invariants that are vio- 
lated while maintaining other program invariants. This can be performed by 
employing a patch generation procedure like search- or semantic-based. 

5. [Patch validation]. Validate the correctness of the generated patches using V. 


Depending on the type of the bug being fixed and the structure of the an- 
alyzed program, different program locations may be of relevance for properties 
Yeorrect ANd Yviolated- Examples include pre- and post-conditions for different 
functions, or loop invariants for some program loops. Note that the first two 
steps of the invariant-based APR process described at Definition 2 are neces- 
sary for increasing confidence in the precision of patches that are generated. The 
actual repair steps of the process, steps 3-5, can be formally stated as follows: 


pt = FV(PGV(FL(correct; Pviolated> P), T), Peorrect, Pviolated) (1) 


where FL is an invariant-based fault localization process, PGV is patch genera- 
tion and validation process using test suite, and FV is a formal patch validation 
process using the verification tool V. If no plausible patch is found or a plau- 
sible patch is found but incorrect, the repair process returns fail. However, if 
the plausible patch passes the verification step carried out by the tool V, the 
process returns a patch. We now turn to discuss how one can generate specifi- 
cations Yeorrect ANd Pviolatea by analyzing the execution information obtained 
by running program P using passing and failing tests. The analysis of fault-free 
and faulty runs leads to the identification of the following formal patterns. 
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1. Qeorrect = Lgooa = V(D(P,Tp)), invariants deduced using only successful 
runs. This set of invariants represents the likely intended behavior of P. 

2. Pfaulty = Imiz = V(D(P,Tp)), invariants deduced using the set of faulty 
runs. Note that the set Zmix may contain both good and bad patterns de- 
pending on how the target bug affects different functionalities of P. 

3. Yviolated = (Zmix \ Zgooa), the set of violated invariants related to the bug. 


It is important to categorize and distinguish inferred patterns (invariants) 
into good and bad patterns, especially when dealing with programs that have 
several functional requirements. This helps to identify the set of desired invari- 
ants to be maintained and violated invariants to be repaired when modifying 
code. It also helps to identify the set of invariants that are relevant to the ana- 
lyzed bug. The soundness of inferred Qeorrect aNd Pviolatea depends heavily on 
the soundness of the employed invariant inference tool as well as the invariant 
refinement process. Increasing the amount of program behavior exercised using 
reachability analysis increases the likelihood that Ycorrect and Yviolated are true. 


Definition 3. (Patch validation in invariant-based APR). Let P be a 
program containing bug b and T be a test suite containing at least one failing 
test and one passing test. Let also pt be a plausible patch that makes P passes 
all test cases in T. The validity of patch pt can be formally checked as follows 


validity (pt) = V (pt, Pcorrect) A =V (pt, Pviolated) (2) 


where V (pt, Pcorrect) € {true, false} and that the tool’s response depends on 
whether the specification is fulfilled or violated in the program being examined. 


To boost confidence in the validity of the resulting patch, we opt to check 
patches against both Ycorrect ANd Pviolatea: However, to lower the cost of calling 
the verifier V against each candidate patch, we aim to implement a three-step 
patch validation method that uses the test suite first and the program verifier 
afterwards. Generating plausible patches is done in the first step using test cases. 
Second step involves formally checking plausible patches against the set of bad 
patterns (property Pviolatea). Patches that pass the first two steps are checked 
against the set of good patterns (property QYeorrect) in the third step. 


3 Fixing Performance Bugs Using Invariant-based APR 


Performance bugs are programming errors that cause significant performance 
degradation - lead to low system throughput. Experience has shown that many 
commercial software that is widely used suffer from performance problems [13, 
6, 10]. Therefore, there is a need to develop a rigorous repair framework for per- 
formance bugs that ensures efficiency gain without compromising functionality. 

One unique characteristic of performance bugs comparing to functional bugs 
is that performance bugs do not affect the functionality of the program (i.e., the 
program is semantically correct but inefficient) and thus the intended behavior 
of the program can be automatically deduced using an invariant inference tool. 
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This section describes an invariant-based APR system for performance bugs 
and demonstrates how it may be applied to handle performance bugs by produc- 
ing patches that ensures efficiency improvement without sacrificing functionality. 


3.1 Invariant-based Repair Framework for Performance Bugs 


In this section we describe an invariant-based repair framework for handling 
performance bugs. The framework consists mainly of the following components: 


1. a set of passing tests (tests that lead to fast runs), 

2. a set of failing tests (tests that lead to slow runs), 

3. runtime monitor to keep track of the program’s execution time and differen- 
tiate between fast and slow runs, and 

4. an automated invariant inference tool (Daikon or CPAChecker) and auto- 
mated invariant verification tool (PVS, Z3 solver, or CPAChecker). 


We now turn to discuss how we define the notions of passing and failing tests 
and the process of generating and validating patches for performance bugs. 
Passing and failing tests for performance bugs: Performance bugs do not 
produce debugging information at runtime: they do not produce crashes, excep- 
tions, or incorrect results. We therefore use a runtime monitor with a predefined 
timer to redefine the concepts of passing and failing tests. We consider test cases 
that lead to fast runs as passing tests while test cases that lead to slow runs as 
failing tests. A repair that transforms slow runs into fast runs while preserving 
the desired behavior of the original program is considered as a valid repair. 
Patch generation strategy for performance bugs: Since we deal with a 
semantically correct but inefficient program, an efficient version of the program 
can often be created by restructuring the original program’s basic components. 
Our preliminary analysis demonstrates the effectiveness of genetic repair tools, 
such as GenProg, in dealing with performance bugs. This suggests that programs 
with performance bugs can be fixed by relatively simple changes. For instance, 
various performance bugs can be fixed by using mutation operators like move, 
swap, delete, and insert employed by genetic repair programs. Consequently, we 
aim to combine our repair framework with genetic-based patch generation tools. 
Patch validation for performance bugs: It should be noted that invariant 
inference tools can also be used to derive predicates related to the non-functional 
attributes of the program. This can be achieved by adding extra non-functional 
variables to the program being repaired. Suppose we have a program P with a 
set of variables V and that P containing a performance bug. We need to check 
whether the generated plausible patch for program P fixes the performance bug 
without introducing new functional bug. To do so, we first generate and validate 
predicates related to the efficiency attributes of the program, as described below. 


1. Add a fresh variable nfv whose value has no impact on the behavior of P. The 
type of performance bug that is being handled determines how nfv is used 
to model the efficiency of the program. However, for the loop programs we 
consider, nfv acts like a counter that is incremented once for each iteration. 
In other words, the number of loop iterations serves as a model for efficiency. 
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2. Use the invariant detection tool D to infer the numerical invariants Z(P, nfv) 
and Z(pt, nfv) for the original and plausible patched version, where Z(P, nfv) 
represents the collection of invariants in program P involving variable nfv. 

3. Compare the numerical predicates in Z(P, nfv) and Z(pt, nfv) to determine 


? 


whether the patched version pt is more efficient than original program P. 


For simplicity reasons, we assume we deal with a program with a single loop. 
The number of loops in the analyzed program, however, determines how many 
more variables are needed. The invariant inference tool D is thus used to infer 
invariants on (V U{nfv}). We then distinguish the following types of predicates: 


— T(P, V): predicates related to the program’s functionality, and 
— T(P, nfv): predicates related to the program’s efficiency. 


Using the generated predicates, one can check the validity of patch pt as follows 
validity(pt) = SEMAEQ (Z(P,V),Z(pt, V)) A PREDSM (Z(pt, nfv), Z(P, nfv)) (3) 


where SEMAEQ is a Boolean operation that checks whether the given sets of 
invariants are semantically equivalent and PREDSM is a Boolean operation that 
checks whether the upper bound in the predicate related to the patched version 
is smaller than the upper bound in the one related to the original program. 

We now describe two formal procedures to verify the validity of plausible 
patches (specification (3)) using the available program verification tools. 


1. Daikon-PVS: In this patch validation procedure, Daikon is used to generate 
predicates related to the functional and efficiency attributes of programs 
P and pt. In the event that Z(P,V) and Z(pt,V) (i.e., predicates related 
to functional attributes) are not identical, it may be necessary to examine 
both equivalence and implication relations between the predicates in those 
sets in order to determine whether P and pt are semantically equivalent. By 
querying the theorem prover PVS, this task can be accomplished. 

2. CPAChecker-PVS: One interesting feature in CPAChecker is that it produces 
correctness witnesses in GraphML format and in those witnesses, one can 
find the invariants of the analyzed program. This feature can be utilized to 
generate the set of invariants in both the original program and corresponding 
plausible one. In case that the invariants generated for both programs are not 
identical, it may be necessary to examine both equivalence and implication 
relations between the predicates in the two sets by invoking the prover PVS. 


3.2 Fixing real-world performance bugs using invariant-based APR 


In this section, we show how invariant-based APR can be used to handle real- 
world performance bugs. For space reasons, we only consider one interesting ex- 
ample of performance bugs (see Listing 1). The bug is based on a real-world flaw 
that occurred in Apache and has also been analyzed by other researchers [14]. 

Analysis of the program in Listing 1: The program aims to determine 
whether a given (target) string is contained within another (source) string. If 
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int found = -1; 
while (found < 0 ) { 
// Check if string source[] contains target [] 
char first = target[0]; 
int max = sourceLen - targetLen; 
for (int i = 0; i <= max; i++) ¢{ 
// Look for first character. 
if (source[i] != first) 4{ 
while (++i <= max && source[i] != first); 
} 
// Found first character 
if (i <= max) { 
int j = i+ 1; 
int end = j + targetLen - 1; 
for (int k=1; j<end && source[j]==target[k]; j++, k++); 


if (j == end) { 
/* Found whole string target. */ 
found = i; 
break; 
J 
} 
F 
// append another character; try again 
source[sourceLen++] = getchar (); 


} 


Listing 1. A challenging performance bug found in Apache 


the target string is found in the source string, the program sets the variable 
found to the index of the target string’s first character. But there is a significant 
performance flaw in the program: when the target string is at the start of the 
source string, the run is fast, and the program stops almost instantaneously. 
On the other hand, the run is slower and takes longer to finish when the target 
string is closer to the end of the source string. This is mostly because there will 
be a significant increase in the number of redundant computations. The fault 
is that the initialization statement of the control variable i of the for loop at 
line 6 should be placed outside the scope of the main while loop just after the 
initialization of the variable found. The longest run that we reported occurs 
when the source string has a length of 107 characters, and the target is a single 
character that is present at the end of the source string. In this instance, the 
program runs for 30 hours before terminating and producing the correct results. 


3.3 Results and analysis 


To handle the performance bug at Listing 1, we select two APR tools: the search- 
based repair tool GenProg [7] and the semantic-based repair tool FAngelix [16]. 
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These are general-purpose repair tools for C code that can be used to fix a 
range of program bugs, including loop program bugs. While GenProg successfully 
generated a plausible patch, FAngelix was unable to produce a plausible one. To 
avoid doing repetitive calculations in the original program, GenProg moved the 
initialization statement of the variable i outside of the for loop at line 6. In other 
words, the program starts with the initialization statement of the variable i in 
the patched version. In this case, the generated patch passes the test cases since 
i is no longer being set to 0 every time the loop receives a new character. 

To check the validity of the plausible patch generated by GenProg, we run the 
tool Daikon and compare the functional and efficiency predicates obtained for 
both the original program and the plausible patch. Daikon generates the same 
set of invariants w.r.t. functional variables (i.e., both the original and the patched 
versions have the same invariants w.r.t. program variables.) This demonstrates 
that the patch maintains the functional behavior of the original program. 

Listing 1 contains four loops: the while loop at line 2, for loop at line 6, while 
loop at line 9, and for loop at line 15. To evaluate the efficiency of the original and 
patched programs, it is sufficient to calculate the upper bound on the number 
of iterations, as the patch does not modify the logic of any of the loops by 
adding or removing an operation. That is, each iteration of the four loops in 
both programs involves the same number of operations. We therefore add four 
iteration counters (cnt2, cnte, cnty, cnti5) to model the efficiency of each loop, 
where the index of the counter corresponds to the line number of the loop being 
analyzed. For instance, the counter cntə is initially set to zero and advanced by 
one whenever the loop at line 2 is run. We make the following observations when 
analyzing the efficiency predicates for both the buggy and patched versions: 


— Invariants generated for the counter variables cntz and cnti5 in the buggy 
and patched versions are the same. This indicates that the patch does not 
affect the number of times the loops at lines 2 and 15 are iterated. 

— The counter variable cntg only advances in the buggy version and results in 
the invariant cntg < 500499. The fact that the patched version no longer 
employs the while loop at line 9 is a sign of a major improvement. 

— Daikon generated the invariant cntg < 1001 in the buggy version and invari- 
ant cntg < 501 in the patched version. This shows that the loop at line 6 is 
iterated 50% less times in the patched version than it is in the original code. 


The aforementioned findings, along with the fact that the derived functional 
predicates of both the original and patched versions are identical, boost our 
confidence about the validity of the generated patch by the tool GenProg. 


4 Related Work 


Patch overfitting in APR: Several solutions have been developed to allevi- 
ate the overfitting problem in APR, such as symbolic specification inference [8], 
machine learning-based prioritization of patches [1], fuzzing-based test-suite aug- 
mentation [5], and concolic path exploration [12]. These solutions rely on limited 
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incomplete test cases and do not guarantee the general correctness of the patches. 
Compared to those approaches that generate test inputs, invariant-based APR 
automatically generates and refines desired invariants that need to be main- 
tained and violated invariants that need to be repaired when modifying code, 
which makes the approach more reliable than existing repair approaches. 

Modern general-purpose APR tools still rely on symbolic execution or con- 
colic execution [9, 12] to discover counterexamples and generate repairs. However, 
these repair approaches manually inspect to determine whether the generated 
patches are correct or identical to developer patches, which could be error-prone. 
Invariant-based APR makes it possible to apply automated verification tech- 
niques to alleviate overfitting problem and formally and systematically check the 
accuracy of generated patches by comparing them to the developers patches. 
Handling performance bugs: Several attempts have been made to detect and 
repair performance bugs in programs using dynamic, static, and hybrid analysis 
approaches [13,6,10]. [10] carried out an empirical investigation into perfor- 
mance bugs and presented several efficiency rules for identifying them. Using 
dynamic-static analysis techniques, several fix strategies have been developed 
in [13] to identify and fix performance problems. However, our method is dif- 
ferent from previous studies in that it is a more general and rigorous technique 
that makes use of program invariant to address loop program performance issues 
and yield reliable patches. Thanks to program invariants, the original program’s 
efficiency can be systematically compared to the patched version. 


5 Conclusion and Future Work 


We described a novel general-purpose APR system based on the concept of pro- 
gram invariants. Invariant-based APR holds the promise to handle a wider range 
of bugs and produce more reliable patches than other APR approaches. This is 
because invariant-based repair systems depend on stronger correctness criteria 
rather than test suites. We demonstrate the usefulness of leveraging invariants in 
APR by developing an invariant repair system for performance defects. The pre- 
liminary results showed that invariant-based APR can assist in generating valid 
patches that ensure efficiency improvement without compromising functionality. 
Future work: To complete the line of research initiated here regarding invariant- 
based APR, we identify the following key directions for future work. 


— First and foremost, we aim to conduct a thorough empirical analysis to deter- 
mine how well invariant-based APR handles functional and non-functional 
defects in programs. This also entails assessing the invariant inference and 
invariant verification tools that are currently accessible. 

— Accurate invariant generation is required to ensure the validity of patches 
produced by invariant-based APR. We conjecture that reachability analy- 
ses can aid with this complex computational task and we aim to combine 
invariant-based APR with program verification tools that support both in- 
variant generation and refinement such as CPAChecker and PathFinder. 
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Abstract. Large language models have become increasingly effective 
in software engineering tasks such as code generation, debugging and 
repair. Language models like ChatGPT can not only generate code, but 
also explain its inner workings and in particular its correctness. This 
raises the question whether we can utilize ChatGPT to support formal 
software verification. 

In this paper, we take some first steps towards answering this question. 
More specifically, we investigate whether ChatGPT can generate loop 
invariants. Loop invariant generation is a core task in software verifica- 
tion, and the generation of valid and useful invariants would likely help 
formal verifiers. To provide some first evidence on this hypothesis, we ask 
ChatGPT to annotate 106 C programs with loop invariants. We check 
validity and usefulness of the generated invariants by passing them to two 
verifiers, FRAMA-C and CPAchecker. Our evaluation shows that Chat- 
GPT is able to produce valid and useful invariants allowing FRAMA-C to 
verify tasks that it could not solve before. Based on our initial insights, 
we propose ways of combining ChatGPT (or large language models in 
general) and software verifiers, and discuss current limitations and open 
issues. 


Keywords: Large language models - Invariant generation - Formal ver- 
ification. 


1 Introduction 


Large language models (LLMs) [B780] are increasingly employed to support 
software engineers in the generation, testing and repair of code [I15]1427]. Gen- 
erative AI can, however, not only generate code, but also provide explanations 
of the inner workings of code and give arguments about its correctness. This 
raises the question whether LLMs can also support formal software verification. 

In this paper, we provide a first step towards answering this question. In gen- 
eral, one can imagine various ways of supporting verifiers, depending on the ver- 
ification approach they employ. Central to all verifiers are, however, techniques 
for dealing with loops. Specifically, for abstracting the behaviour of loops, veri- 
fiers aim at computing loop invariants. Our first step in evaluating ChatGPT’s 
usefulness for software verification is thus the generation of loop invariants. 

To this end, we ask ChatGPT to annotate C-programs with loop invariants. 
We have chosen 106 C-programs from the Loops category of the annual com- 
petition on software verification [7]. To enable the usage of these invariants by 
© The Author(s) 2024 
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Prompt> Compute a loop invariant for the following program! 


void func(unsigned int n) 
{ 
unsigned int x=n, y=0; 
//@ loop invariant [mask]; 
while(x>0) { 
K--3 ytt; 
} 


assert (y==n); 


o } 


Infilling provided by ChatGPT: x+y== 
Fig. 1. Example task: loops/count_up_down-1. 


verifiers, we needed the invariants to be given in some formal language. For 
this, we have chosen ANSI/ISO C Specification Language (ACSL) [5], a design- 
by-contract like annotation language for C. Initial experiments confirmed that 
ChatGPT “knows” ACSL. The main part of our experiments then concerned 
the evaluation of the invariants with respect to (a) validity and (b) usefulness 
for verifiers. The first aspect required checking whether a proposed invariant 
is actually a proper invariant, i.e., whether the computed predicate holds at 
the beginning of the loop and after every loop iteration. We employ the state-of 
the-art interactive verifier FRAMA-C |4| for this validity checking. For evaluating 
the usefulness of invariants, we provided two state-of-the-art verifiers (FRAMA-C 
SV [9] and CPAchecker [8]) with the code annotated by the proposed invariant, 
and evaluated whether the verifiers can then solve verification tasks which they 
could not solve without the invariant] Our results confirm that ChatGPT can 
support software verifiers by providing valid and useful loop invariants, but also 
show that more work needs to be done — both conceptually and practically — to 
have LLMs provide a significant support for software verification. 


2 Invariant Generation with ChatGPT 


Our goal is to provide initial insights into the capabilities of large language 
models, specifically ChatGPT, to support formal software verification. For this, 
we propose the task of loop invariant generation. 


Loop invariant generation. The goal of loop invariant generation is to gener- 
ate valid and useful loop invariants for a given program. A valid loop invariant 
is an invariant that (1) holds true before the first loop execution and (2) after 
each loop iteration. A useful loop invariant is a valid loop invariant that is useful 
for proving the given program correct. 

To understand this, let us consider the example task shown in Figure [I] Here, 
the large language model is tasked to analyze the given program and to propose 
a loop invariant. For the given program, the invariant x + y == n represents a 
valid loop invariant: as x is initialized to n and y to O, the invariant holds (1) 


1 In case of CPAchecker, we restrict CPAchecker’s own invariant generation facilities 
as to be able to see the plain effect of the generated invariant. 


268 C. Janen et al. 


before the first loop execution. The invariant furthermore holds (2) after each 
loop iteration as y is incremented each time x is decremented. 

The provided loop invariant also is a useful loop invariant: As x == 0 at the 
end of the loop execution and x + y == n holds after the loop execution, we 
can deduce that the assertion y == n is not violated after the loop execution. 
The invariants x <= n and y >= 0 also represent valid loop invariants but they 
are not useful for proving the program correct. 


The idea is now to let ChatGPT generate such loop invariants. To this end, we 
need to tell ChatGPT what its task is. As briefly mentioned in the introduction, 
we expect ChatGPT to give loop invariants in the form of ACSL (ANSI C Spec- 
ification Language [5]) assertions. ACSL is a specification language for C and of- 
fers a number of keywords for specifications in a design-by-contract style. Among 
others, there is the keyword loop invariant. ACSL specifications are written 
inside comments of the form //@. Besides the plain code, Figure []also shows the 
prompt used to tell ChatGPT its task (first line), and the code location and form 
of the invariant we expect to be generated (//@ loop invariant [mask] }*| We 
thus phrase the task as an infilling problem [2I], i.e., we require ChatGPT to 
fill in some meaningful contents for [mask]. In this example, ChatGPT returns 
the above discussed invariant. We arrived at this form of stating the task after 
several experiments with different prompts. 


Feeding loop invariants into verifiers. For evaluation of the generated in- 
variants, we need to determine their validity and usefulness. To this end, we first 
of all need to feed them into some verifier. Interactive verifiers natively provide 
ways of feeding in such inputs. In an interactive verification run, a software en- 
gineer provides program annotations (e.g., invariants) and the verifier tries to 
prove that some given specifications are never violated] 

In this work, our goal is to evaluate the ability of large language models 
to support verifiers. Therefore, we replace the software engineer by ChatGPT 
and let it interact with the interactive verifier. Currently, the language model 
only interacts by exchanging loop invariants (which is inline with our evaluation 
goal). However, in future work it could be interesting to let the language model 
generate other types of annotations. 

During our evaluation, we use the interactive verifier FRAMA-C [4] to eval- 
uate the validity and usefulness of the provided invariants. For evaluating the 
usefulness, we furthermore employ an automatic verifier (CPAchecker [8]). To 
also allow for interaction in this case, we employ ACSL2Witness [10] to convert 
the ACSL annotated program to a correctness witness which CPAchecker is then 
able to use in its verification. 


Related work. There are only a few works that address invariant generation via 
machine learning. The work in [32] uses large language models to predict invari- 
ants of Java programs. They specifically trained large language models to predict 


2 Prompt and answer from ChatGPT are abbreviated to fit the figure; the full prompt 
is given in the appendix. 

3 There exists a variety of properties that can be checked via verification; we focus 
here on checking for violations of assertions. 
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Daikon [20] generated invariants. Their evaluation does not consider validity or 
usefulness of the generated invariants but only concerns whether Daikon invari- 
ants can be recovered. In contrast, in this work, we rely on instruction-tuned 
large language models such as ChatGPT without any training and we use formal 
verification approaches to evaluate the validity and usefulness of loop invariants 
generated for C code. 

Many approaches [36[31/22)35]12], which are related to or based on Syntax- 
Guided Synthesis, have addressed invariant generation via machine learning tech- 
niques. However, most of the existing techniques rely on traditional machine 
learning or graph neural network based techniques instead of large language 
models. We are interested in the capabilities of large language models in sup- 
porting C software verifiers. 

Beyond invariants, there also exist other ways to support software verifiers. 
For example, the work in [3/23] supports verifiers with neural-network based 
termination analyses. However, these approaches are often deeply integrated. 
We chose loop invariant generation as many software verifiers already support 
the exchange of invariants. 


3 Evaluation 


We evaluate ChatGPT on the task of loop invariant generation in C code. For 
the evaluation, we use a benchmark of 106 verification tasks taken from the 
SV-COMP Loops category [7]. We have chosen all tasks which (a) have ACSL 
annotations (to be able to compare the generated with manually constructed 
invariants), (b) have one loop only and (c) are correct, i.e., the assertions in the 
code are valid. During our evaluation, we remove all ACSL invariant annotations 
and let ChatGPT regenerate them. Now, based on our evaluation setup we aim 
to answer the following research question: 


Can ChatGPT support software verifiers with valid and useful loop invariants? 


Experimental setup. For generating loop invariants, we employ the ChatGPT 
(GPT-3.5) snapshot from June 2023. The model is queried via the OpenAI AP 
During our evaluation, we set the sampling temperaturd? of ChatGPT to 0.2 and 
sample up to k (k = 5) completions per task. We collect all invariants by parsing 
the generated completions with the infillings. 

For checking the validity of the generated invariants, we use the interactive 
verifier FRAMA-C [4]. We annotate each task with one of the n generated invari- 
ants. In total, we thus generate up to n annotated versions of each task which 
we use for validation. We count loop invariants as validated only if FRAMA-C 
WP can validate them within 105] 


4 https: //platform.openai.com/, accessed in Sept. 2023 

5 The temperature controls the randomness of ChatGPT’s outputs; a lower temper- 
ature leads to more deterministic outputs. We have chosen a low temperature to 
obtain invariants in a processable format. 

6 Note that a negative answer of FRAMA-C does not necessarily mean that the can- 
didate invariant is invalid. 
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Table 1. Results for 106 verification tasks, divided by subcategory of the Loops cate- 
gory (giving total number of tasks, number of successfully validated invariants, number 
of verified tasks per verifier using either the generated or the human provided invariant 
of the benchmark, and in gray the number of useful invariants) 


Tasks FRAMA-C k-induction 

Subcategory 

total val-invs. |GPT invs. Human invs. GPT invs. Human invs. 
loop-accelaration 15 8 1 (1) 2 (2 6 (3) 6 (3) 
loop-crafted 2 2 0 (0) 0 (( 2 (0) 2 (0) 
loop-industry-pat. 1 1 0 (0) 0 (( 1 (0) 1 (0) 
loop-invariants 8 4 3 (3) 3 (3 0 (0) 1 (1) 
loop-invgen 3 3 0 (0) 0 (¢ 0 (0) 0 (0) 
loop-lit 13 4 1 (1) 4 ( 3 (2) 4 (3 
loop-new 7 4 1 (1) 1 ( 0 (0) 0 (0 
loop-simple 1 1 1 (1) 1 ( 1 (0) 1 (0 
loop-zilu 22 18 10 (10) 11 (1 11 (6) 10 (5 
loops 13 13 5 (5) 6 (6 8 (1) 8 (1) 
loops-crafted-1 21 17 0 (0) 0 (0 4 (3) 7 (6) 
total 106 75 22 (22) 28 (28 36 (15) 40 (19 


For evaluating the usefulness of the generated invariants, we now annotate 
the task with the validated invariants from the previous step. If multiple invari- 
ants are validated per task, we conjunct them to a single invariant and annotate 
the task with the conjuncted invariant{’| As verifiers, we consider the interac- 
tive verifier FRAMA-C SV and the automatic verifier CPAchecker [8]. We 
configure CPAchecker to run k-induction without loop unrolling (similar to [10] 
to be able to see the effect of the generated invariant). Note that this restricts 
CPAcheckers facilities for verification. Finally, all verifier and validation runs 
are executed via BenchExec [6] on a 24-core machine with 128GB RAM running 
Ubuntu 22.04 with a maximum timelimit of 900s. 


Results. Our main results are shown in Table |1| On the left side of the table, 
we show the total number of tasks per subcategory (total) and the number of 
tasks where at least one of the generated invariants can be validated (val-invs.). 
On the right side of the table, we report on the verification results obtained from 
executing FRAMA-C and CPAchecker (using k-induction without loop unfolding) 
on the verification tasks with at least one validated invariant. We report the 
total number of tasks that can be verified with a ChatGPT provided invariant 
(GPT invs.) and a human provided invariant (Human invs.), i.e., the ACSL 
invariant given in the benchmark. In addition, we also report the number of 
useful invariants in gray brackets. Useful here means that the verifier cannot 
complete the verification task without the invariant. 


7 The logical conjunction of two valid invariants is again a valid invariant. 
8 Frama-C SV is a version of FRAMA-C specifically configured to work well on SV- 
COMP task. 
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1 void func() { 
2 unsigned int x = 0, y = 1; 
//@ loop invariant [mask]; 
4 while (x < 6) { x++; y *= 2; } 


assert(y % 3 != 0); 
6 } 
Infilling provided by ChatGPT: x <= 6 && y == pow(2, x) 
Human: (x==0 && y==1) || (x==1 && y==2) || (x==2 && y==4) || 


Fig. 2. Example task: loop-accelaration/underapprox_1-2 


ChatGPT can generate valid loop invariants. We find that ChatGPT can gen- 
erate valid loop invariants for 75 out of 106 tasks (as validated by FRAMA-C). 
Note that ChatGPT proposes loop invariant candidates for all 106 tasks and by 
manual inspection we found that some of the generated loop invariant candi- 
dates are still meaningful, even though they are not validated by FRAMA-C. An 
example is shown in Figure |2} ChatGPT produces a meaningful loop invariant 
candidate, but FRAMA-C rejects the candidate due to technical reasonq)| The 
human-annotated invariant avoids this problem by enumerating all variable as- 
signments. In total, we found by manual inspection that 10 out of 31 invariant 
candidates not validated by FRAMA-C are meaningful. 


Interestingly, we found during our manual inspection that ChatGPT in many 
cases seems to apply a set of useful heuristics to determine loop invariant candi- 
dates. One of the most successful heuristic applied by ChatGPT on our bench- 
mark is the copy assertion heuristic. Here, ChatGPT proposes an invariant that 
is equivalent to a condition found in a nearby assertion. The heuristic is applied 
in 30 out of 106 tasks and 23 of the resulting invariants are validated. 


ChatGPT can support verifiers with useful loop invariants. We find that Chat- 
GPT can produce useful invariants that can support software verifiers in their 
verification tasks. In comparison to the human-provided invariants, ChatGPT 
produced useful invariants for 22 out of 28 tasks in the case of FRAMA-C and 
for 15 out of 19 tasks in the case of CPAchecker’s k-induction. Interestingly, we 
find one example in the loop-zilu subcategory where the invariant proposed by 
ChatGPT is more useful for CPAchecker than the human annotated invariant. 
The example is shown in Figure |3| Here, ChatGPT proposes the invariants j 
>= 0 and k >= 0 conjuncted with the human-provided invariant which is ob- 
viously useful to prove that k >= O holds true at the end of the loop. Note 
that, while this seems to be a case where the copy assertion heuristics is ef- 
fective, FRAMA-C does not validate the invariant candidate k >= 0 alone. The 
conjunction with j<=n && k>=n-j is important to validate the invariant. Still, 
by manual inspection we find that the copy assertion heuristic of ChatGPT is 
effective for providing useful invariants in 11 out of 22 cases for FRAMA-C and 
in 5 out of 15 cases for k-induction. 


° FraMaA-C reports an invalid conversion from integer type to a floating point type 
due to the pow operator and thereby fails. 


3 } 
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void func(int k, int j, int n) { 

if (! (n>=1 && k>=n && j==0)) return; 
//@ loop invariant [mask]; 

while (j<=n-1) { j++; k--; } 

assert (k>=0); 


Infilling provided by ChatGPT: j >= 0 && k >= 0 && j <= n && k >= n - j 
Human: j <= n <= k + j 


Fig. 3. Example task: loop-zilu/benchmark04_conjunctive. 


4 Limitations and Open Issues 


We discuss limitations and open issues in using large language models for sup- 
porting software verifiers. 


Cooperation between Language Model and Software Verifier. Our eval- 
uation has shown that large language models such as ChatGPT are already ca- 
pable of producing valid and useful loop invariants for our benchmark tasks. 
However, to be useful in practice, there are several challenges we have to master. 
A key challenge is the communication and cooperation between large language 
model and software verifier. Currently, we have implemented a top-down ap- 
proach for invariant generation, i.e., we start by querying the language model 
for invariant candidates, validate them and then provide them to a verifier. 
The LLM has no knowledge about the specifics of the underlying validator or 
the verifier used in the process. This can ultimately hinder the large language 
model from generating valid (as validated by the validator) or useful (as deter- 
mined by the verifier) loop invariants. During our evaluation, we already have 
encountered an example where this knowledge gap leads to meaningful but not 
validated invariant candidates (see Figure p}. Here, the language model has no 
knowledge about the specifics of the validator used (FRAMA-C) or at least is not 
informed that the proposed expression leads to a parsing error. Communicating 
this information allows the large language model to self-debug [17] its invariant 
proposals and thereby propose invariant candidates that are validated by the 
validator and that are useful for the verifier. For example, if we report the im- 
plicit conversion error back to ChatGPT, it generates a new invariant candidate 
(y == 1 « x) for our example in Figure |2| that is validated by our validator. 
Overall, we envision a cooperative ap- 

proach between large language model, (d) not useful! 


invariant validator and software veri- ser oe 


fier as shown in Figure |4} In an inner LLM Software 
=~. (c) not valid! Verifier 


loop, the large language model coop- 

erates with the validator to identify x N, pr 
valid loop invariants. Here, the lan- (a) valid? : : seful? 
guage model proposes invariant candi- Es 
dates, obtains feedback from the val- 
idator and refines its invariant sugges- 
tion. In the outer loop, the language model cooperates in the same way with the 


Fig. 4. Conceptual overview. 
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software verifier to find useful loop invariants. This work already implements (a) 
the validation of invariant candidates and (b) the verification with useful invari- 
ants. The key challenge is now to determine which feedback is needed from (c) 
the validator or (d) the software verifier to effectively guide the language model 
to valid and useful invariants. 

A subsequent study [28] provides first insights in the feasibility of our ap- 
proach. By providing feedback to the language model (in form of error messages 
produced by Frama-C), the authors showed that language models can effectively 
repair its invariant proposals. We believe that providing more detailed feedback 
(e.g. by providing a more detailed reasoning why the validation process fails) can 
further boost the performance of language model based invariant generation. 

Finally, we can envision that our approach to language model and verifier 
cooperation may be useful beyond invariant generation. For example, TriCo [2] 
proposes to check the conformity between implementation and code specification 
with a verifier. A large language model could react to conformity violations and 
repair either the implementation or the specification. 


Unified assertion language. Our approach for invariant generation requires 
that large language models, validators and software verifiers communicate in- 
variants with a common specification language (e.g., ACSL in our case). How- 
ever, in practice, there exists a zoo of interactive verifiers such as DAFNY [29], 
FrAMA-C [4], KEY [I], KIV [19], and VERIFast and automated software 
verifiers such as CBMC [18], CPAchecker [8], Symbiotic [13], and Ultimate Au- 
tomizer [24]. All of them implement their own custom way to communicate 
invariants. Therefore, we either have to find a way to unify the communication 
of invariants between systems or we have to define transformations that convert 
between communication formats. In this work, we have already employed the 
transformation ACSL2WITNESS to convert ACSL to a format understand- 
able by automated software verifiers. In the future, we plan to explore alternative 
transformations to support a wider range of validators and verifiers. 


Known limitations of LLMs. Large language models have many known lim- 
itations such as hallucinations [26], input length limitations [80], and limited 
reasoning capabilities [34]. All of this can significantly limit the ability of large 
language models to produce valid and useful loop invariants or to support soft- 
ware verifiers in general. However, active research is underway to overcome these 
limitations, and a number of proposals have already been made to reduce halluci- 
nations [33], increase input length [16], or improve the reasoning performance [38] 
of large language models. It would be interesting for future work to evaluate how 
these solutions impact the loop invariant generation abilities of large language 
models. 


5 Conclusion 


In this work, we provided a first step towards answering the question whether 
large language models can support formal software verification. For this, we 
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have evaluated ChatGPT on the task of loop invariant generation. Our eval- 
uation shows that ChatGPT can support software verifiers by providing valid 
and useful loop invariants. We plan to further improve the support for software 
verification in future work by a cooperative approach that enables exchange of 
information between large language models, invariant validators and software 
verifiers. In particular, we intend to develop methods for providing feedback to 
LLMs whenever candidate invariants are found to not be valid. 


A Prompting ChatGPT 


The full prompt and the answer of ChatGPT for the example task loops/co- 
unt_up_down-1 is shown in Figure |5| We use the same prompt for all tasks. 
The answer of ChatGPT can slightly vary between executions. Therefore, we 
generate up to k answers and collect invariants from all answers via a regular 
expression. 


Prompt> Compute a loop invariant for the annotated loop including [invariant] in the 
following C code. Please use the format of ACSL annotations and always end your 
response with //@ loop invariant X ; where X is the computed invariant. 


[...] 


unsigned int __VERIFIER_nondet_uint (); 


int main() 


5s { 


unsigned int n = __VERIFIER_nondet_uint () ; 
unsigned int x=n, y=0; 
//@ loop invariant [invariant]; 


while (x>0) 


__VERIFIER_assert (y==n) ; 


5 } 


ChatGPT> 
The loop invariant for the given code can be stated as follows: 


Cee 


//@ loop invariant y + x =n ; 


tet 


[...] 


Fig. 5. Full prompt and answer for example task: loops/count_up_down-1. 
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Abstract. Deductive verification tools can prove a large range of pro- 
gram properties, but often face issues on recursive data structures. Ab- 
stract interpretation tools based on separation logic and shape analysis 
can efficiently reason about such structures but cannot deal with so large 
classes of properties. This short paper presents an ongoing work on com- 
bining both techniques. We show how a deductive verifier for C programs, 
FRAMA-C/WP, can benefit from a shape analysis tool, MEMCAD, where 
structural and separation properties proved in the latter become assump- 
tions for the former. A case study on selected functions of the tpm2-tss 
library using linked lists confirms the interest of the approach. 


Keywords: deductive verification, shape analysis, abstract interpretation, linked 
lists, Frama-C, MemCAD 


1 Introduction 


Context and Motivation. Deductive verification tools were successfully used in 
many case studies [4] to prove a large range of safety, security and functional 
properties. Such tools often have issues to conduct automatic proof on code with 
recursive data structures (e.g. linked lists, trees, etc.), in particular, due to com- 
plex memory models they need. The user has to guide the proof by interactively 
proved lemmas, assertions, etc. Abstract interpretation tools based on separation 
logic and shape analysis [3] can efficiently reason about such structures but typ- 
ically cannot deal with so large classes of properties. This short paper presents 
new ideas and emerging results on combining both techniques trying to take the 
best of both worlds. 


Approach and Results. We present a verification approach combining a popular 
deductive verifier for C programs, FRAMA-C/WP [6], with a shape analysis tool, 
MEMCAD [10]. The main idea is to prove structural and separation properties 
in MEMCAD and then to assume them in FRAMA-C/WP in order to increase 
the level of automation of the latter and overcome some of its limitations. We 
© The Author(s) 2024 
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apply it on a real-life case study using linked lists: a few (slightly simplified) 
functions of tpm2-tss?, a popular library for communication with a Trusted 
Platform Module (TPM). Recent work [11] demonstrated that deductive verifi- 
cation of the library functions manipulating linked lists was relatively hard, and 
required many additional lemmas and assertions. 

The contributions of this paper include the presentation of a combined verifi- 
cation technique using deductive verification and shape analysis, its illustration 
with FRAMA-C/WP and MEMCAD on a function manipulating linked lists, as 
well as a successful case study on a set of functions of the tpm2-tss library. 


2 Background 


2.1 Deductive Verification with Frama-C/Wp 


FRAMA-C [6] is an integrated toolbox built around a kernel offering core ser- 
vices and plugins dedicated to specific analysis or verification tasks for C code, 
e.g. value analysis, runtime assertion checking and deductive verification. ACSL 
(ANSI C Specification Language) [6] is the common specification language of the 
plugins. The WP plugin performs modular deductive verification: each function 
is verified independently. It generates verification conditions (VCs) from the C 
code with ACSL annotations and requests their proof by the QED simplifier or 
by external provers. 

We illustrate the main ACSL features on the running example’ of Fig. 1, 3, 4, 
5, presented as we go, where ACSL notation (e.g. \forall, integer, ==>, <=, &&) 
is pretty-printed (resp., as V, Z, >, <, A). Lines 69-85 of Fig. 4 show a contract for 
function list_push (detailed below) that adds a new value into a linked list (cf. 
Lines 1-2 of Fig. 1), allocating a new cell. The contract includes pre-conditions 
(requires clauses) and post-conditions (ensures clauses). The assigns clause is a 
special kind of post-condition that indicates the memory locations the function is 
allowed to modify. ACSL formulas are mostly multi-sorted first-order logic where 
types are either C types or logic types (such as Z, the type of mathematical 
integers). ACSL provides built-in constructs such as \result (the value returned 
by the function) and predicates such as \valid(p) (stating that pointer p refers 
to an allocated memory location, so that *p can be safely read and written) and 
\separated(p1,p2,...) (stating that the memory locations referred to by given 
pointers do not intersect). Notice that the considered memory locations are here 
indicated by pointers. Users can define predicates such as those in Fig. 1, adapted 
here from a previous work [1] on verifying linked lists in WP. 

The main predicate is the inductively defined predicate linked_11 (Lines 10- 
19) stating that a linked list (segment) of int values (defined on Lines 1-2) 
from pointer bl to pointer el (excluded) is a well-formed list represented by 
an ACSL logical list 11. In other words, 11 contains the pointers to the cells 
of that list segment (or the whole list if el is NULL). ACSL lists are similar to 


3 https: //github.com/tpm2-software/tpm2-tss 
4 Available in a companion artifact on http://doi.org/10.5281/zenodo.10497923 
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1 typedef struct cell_s {struct cell_s* next; int data;} cell; 
2 typedef cell* list; 
3 /*@ 

predicate ptr_sep_from_list{L}(cell* c, \list<cell*> 11) = 


4 

5 V Z n; 0 <n < \length(11) > \separated(c, \nth(1l, n)); 
6 predicate dptr_sep_from_list{L}(cell** c, \list<cell*> 11) = 
7 V Z n; 0 <n < \length(1l) > \separated(c, \nth(1l, n)); 
8 predicate in_list{L}(cell* c, \list<cell*> 11) = 

9 J3 Z n; 0 <n < \length(1l) A \nth(1l, n) == C; 

10 inductive linked_11{L}(cell *bl,cell *el,\list<cell*> 11) { 
11 case linked_ll_nil{L}: 

12 V cell *el; linked_11{L}(el, el, \Nil); 

13 case linked_ll_cons{L}: 

14 V cell *bl, *el, \list<cell*> tail; 

15 (\separated(bl, el) A \valid(bl) A 

16 linked_11{L}(bl->next, el, tail) A 

17 ptr_sep_from_list(bl, tail)) > 

18 linked_1l1{L}(bl, el, \Cons (bl, tail)); 

19 } 

20 predicate unchanged_11{L1, L2}(\list<cell*> 11) = 

21 V Z n; O0 <n < \length(11) > 

22 \valid{Li}(\nth(ll,n)) A \valid{L2}(\nth(1ll,n)) A 

23 \at ((\nth(11,n))->next, L1) == \at((\nth(11,n))->next, L2) A 
24 \at ((\nth(1l,n))->data, L1) == \at((\nth(1ll,n))->data, L2); 
25 axiomatic cell_to_ll £ 

26 logic \list<cell*> to_l1{L}(cell* beg, cell* end) 

27 reads {node->next | cell* node; 

28 \valid(node) A in_list(node, to_ll(beg, end))}; 
29 axiom to_ll_nil{L}: V cell *node; 

30 to_1l{L}(node, node) == \Nil; 

31 axiom to_ll_cons{L}: V cell *beg, *end; 

32 (\separated(beg, end) A \valid{L}(beg) ^ 

33 ptr_sep_from_list{L}(beg, to_ll{L}(beg->next, end))) > 
34 to_ll{L}(beg, end) == 

35 \Cons (beg, to_l1{L}(beg->next, end)); 

36 } 

37 */ 


38 #include "lemmas_min.h" 


Fig. 1. Types and ACSL predicates for linked lists. 


lists in functional programming. In the inductive case (linked_ll_cons) over- 
lapping list cells (or cyclic lists) are avoided by requiring that the first cell b1 
is separated from all the other cells in the list including e1, so the list is well- 
formed. The predicates on Lines 4—9 use predefined functions: \length and \nth 
that returns the n‘® element of a logic list. Predicates can take one or several 
program points (C labels plus some ACSL labels: Pre and Post). The built-in 
\at(e, L) specifies the value of an expression e at a label L. Using these fea- 
tures, unchanged_11 states that a logic list does not change between two program 
points (Lines 20-24). Finally, Lines 25-36 define an axiomatic function to_11 
that constructs a logic list from a C linked list. While it would be possible to 
write requires 4\list<cell>11; linked_11(*pl, NULL, 11); instead of Line 72 
of Figure 4, the scope of the existential quantifier is just this line. Therefore, 11 
cannot be used in the post-conditions, hence the need for to_11. 

Let us now detail the contract of list_push (its code is detailed below). 
The pre-conditions state that p1 is a valid pointer to a list (Line 70), separated 
from every element in the list (Line 71), and refers to a linked list verifying the 
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ll_cel11<0,0> := cell<0,0> := 
| [0] | [0] 

- emp - emp 

- this = 0 - this = 0 


| [2 addr int] 
- this->0 |-> $0 * this->4 |-> $1 
- alloc(this, 8) & this # 0. 


[2 addr int] 

- this->0 |-> $0 * $0.11_ce11() * 
this->4 |-> $1 

- alloc(this, 8) & this # 0. 

cell_plist<0,0> := 

| [2 addr addr] 
- this->0 |-> $0 * $0.cell() * 

this->4 |-> $1 * $1.plist() 

- alloc(this, 8) & this # 0. 


plist<0, 0> := 
| [1 addr] 
- this->0 |-> $0 * $0.11_cel11() 
- alloc(this, 4) & this # 0. 


nS * €¢ fF tune OYOs 


Fig. 2. Inductive predicates for MEMCAD. 


inductive predicate linked_11 (Line 72). Line 73 specifies that the only locations 
the function is allowed to modify are *pl, the head pointer of the list, and 
\at(**pl, Post), the first element of the list at the exit point, i.e. the freshly 
allocated cell. We cannot reference the new list cell at the entry point because it 
is not allocated yet. In post-conditions, the returned value indicates whether or 
not the allocation is successful (Line 76). Regardless of the success, we expect the 
list invariants to hold (Lines 74-75). In case the allocation fails, we expect the 
pointer *pl and the list contents to be unchanged (Lines 77-79). If it succeeds, 
we expect the list to be composed of the new cell followed by the old list (Lines 
80-81), the old list being unchanged (Lines 82-83), and the fields of the new 
cell, next and data, resp., to point to the old list (Line 84) and to contain the 
expected value (Line 85). 


2.2 Shape Analysis with MemCAD 


The purpose of MEMCAD [10] is to automatically infer precise invariants about 
programs manipulating complex data structures. It is based on shape analysis [3], 
a static code analysis technique that discovers and verifies properties of recursive, 
dynamically allocated data structures. It relies on separation logic and abstract 
interpretation. Unlike in WP, the analysis is global. 

To use MEMCAD on linked lists defined on Lines 1-2 of Fig. 1, the user 
first defines an inductive predicate expressing a structural invariant of a well- 
formed linked list, such as predicate 11_ce11 on Lines a-h of Fig. 2. A list, i.e. 
a pointer to a list cell, satisfies the predicate in two cases. Each case defines 
a memory separation formula and additional constraints. In the first case, the 
pointer is null (Line d) and no specific memory separation is required (Line c). 
This case has no additional arguments (cf. [0] on Line b). The second case has 
two (existentially quantified) arguments: an address and an integer (Line e), 
denoted, resp., by $0 and $1 in the rest of the case. The pointer is non null 
and refers to a valid memory block of 8 bytes (Line h), assuming a 32-bit sys- 
tem. Lines f-g define the values of the fields next and data (at offsets 0 and 4) 
as $0 and $1, and require separation between those fields and the rest of the 
list. The separation is expressed by the separating conjunction “*” [10]. Notice 
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//@ assigns \nothing; 
void mc_chk_plist(list* pl) { 
_memcad("check_inductive(pl,plist)"); 


} 
typedef struct {cell* c; list* pl;} cell_plist; 


//@ assigns \nothing; 
void mc_chk_sep_cell_plist(cell* c, list* pl) { 
cell_plist tmp; 
tmp.c = c; tmp.pl = pl; 
cell_plist* ptmp = &tmp; 
-memcad("check_inductive (ptmp,cell_plist)"); 
} 


Fig. 3. Auxiliary MEMCAD checks for linked lists. 


that “...*$0.1l_cel1l1()*...” on Line f specifies separation recursively, for all 
list cells reached by the predicate via the inductive case. The user can insert 
the instruction _memcad("add_inductive(1,1l_cell)"); to assume that list 1 re- 
spects predicate 11l_ce1ll, or _memcad("check_inductive(1,1l_cell)"); to check 
the same property in MEMCAD. 

Predicate cell on Lines n-t is very close to predicate 11_cell except that 
it only defines one list cell without recursion. Predicate plist on Lines j—m 
expresses that a double pointer to a list cell (i.e. of type list*) is valid, refers 
to a well-formed list and is separated from its cells. Predicate cell_plist is 
explained below. 


3 Combined Approach 


3.1 Shape Analysis Assisted Verification 


To prove complex memory-related annotations with WP on real-life code [11], the 
user typically has to manually annotate the code with many additional carefully 
chosen assertions establishing structural invariants and separation properties at 
several intermediate program points, and to add numerous lemmas to facilitate 
reasoning about them (whose proof must usually be done manually in Coa, an 
interactive proof assistant). Our approach proposes to let MEMCAD deal with 
the structural invariants of recursive data structures and separation properties, 
and to admit them in WP at some key points. 

In order to use both tools simultaneously in this way, we first need to show the 
equivalence between MEMCAD and WP inductive predicates. For MEMCAD, 
predicate 11_cel1 (Lines a-h of Fig. 2) specifies that each element of the list 
is a valid cell, is separated from every other cell of the list and the list is null- 
terminated. This is equivalent to the linked_11 predicate for WP (Lines 10- 
19 of Fig. 1) when we consider the whole list. Indeed, when el is NULL, this 
predicate also means that every list cell is valid and separated from any other list 
cell, and the list is null-terminated. Explicit separation conditions in the ACSL 
predicate for WP are expressed by the separating conjunction in the MEMCAD 
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59 /*@ 

60 assigns \nothing; 

61 ensures \result # NULL => (\valid(\result) ^ 
62 \result ->next == NULL A \result->data == 0); */ 
63 cell* calloc_cell() { 

64 cell* c = malloc(sizeof(cell)); 

65 if (c) { c->next = NULL; c->data = 0; } 

66 return c; 

67 } 

68 

69 /*@ 


70 requires \valid(pl); 

71 requires dptr_sep_from_list(pl,to_11l(*pl, NULL)); 
72 requires linked_11l(*pl, NULL, to_11(*pl, NULL)); 
73 assigns *pl, \at(**pl, Post); 

74 ensures dptr_sep_from_list(pl, to_ll(*pl, NULL)); 
75 ensures linked_11l(*pl, NULL, to_11(*pl, NULL)); 


76 ensures \result \in {0, 1}; 

77 ensures \result == 0 > 

78 unchanged_11{Pre, Post}(to_11(*pl, NULL)); 

79 ensures \result == 0 > *pl == \old(*pl); 

80 ensures \result == 1 > 

81 to_ll(*pl, NULL) == ([l*pll] ^ to_11(\old(*pl), NULL)); 
82 ensures \result == 1 > 

83 unchanged_11{Pre, Post}(to_11(\old(*pl), NULL)); 
84 ensures \result == 1 => (*pl)->next == \old(*pl); 
85 ensures \result == 1 > (*pl)->data == data; */ 

86 int list_push(list* pl, int data) { 

87 cell* c = calloc_cell(); 

88 if (!c) return 0; 


89 mc_chk_sep_cell_plist(c, pl); 

90 //@ admit ptr_sep_from_list(c,to_11(*pl,NULL)); 
91 //@ admit \separated(pl, c); 

92 //@ ghost Alloc:; 

93 c->next = *pl; 

94 //@ assert unchanged_ll{Alloc,Here}(to_ll{Alloc}(*pl,NULL)); 
95 c->data = data; 

96 //@ ghost Link:; 

97 *pl = c; 

98 /*@ assert unchanged_11{Link , Here}( 

99 to_11{Link}(\at(*pl,Pre),NULL)); */ 

100 mc_chk_plist (pl); 

101 //@ admit dptr_sep_from_list(pl,to_11(*pl,NULL)) ; 
102 //@ admit linked_11(*pl,NULL,to_11(*pl,NULL)); 
103 return 1; 

104 } 


Fig. 4. Functions calloc_cell and list_push with contracts. 


counterpart. (Notice that separation of b1 with NULL on Line 15 is trivial.) The 
sequence of list elements, expressed by a logic list in AcSL and used to prove 
functional properties about the contents of the list (cf. Lines 80-81) in WP, 
does not need to be specified for MEMCAD, which we only use to reason about 
structural properties. 


To check if invariants hold in MEMCAD, we define check functions shown 
in Fig. 3. These functions are specified to be side-effect-free (cf. Lines 40, 47) to 
prevent interference with the proof in WP. 
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The first function, mc_chk_plist (Lines 41-43), checks that pl respects the 
plist predicate, i.e. is a valid pointer to a well-formed list from which it is 
separated (Line 42, see also Lines j—m of Fig. 2). 

The goal of the second function, mc_chk_sep_cell_plist, is to check that 
c refers to a list cell, pl respects the plist predicate, and the corresponding 
pointer and the list cells are separated from the cell referred to by c. To do that 
in MEMCAD, we introduce an ad-hoc structure cell_plist with both pointers 
(Line 45). The function initializes a local structure (Lines 49-50) and takes its 
address (Line 51) in order to express the required check (Line 52). This check 
relies on the predicate cell_plist (Lines v-z of Fig. 2) stating that the given 
pointer is non-null and refers to a structure with two pointers at offsets 0 and 4, 
denoted $0 and $1, referring, resp., to a cell and to a double pointer to a well- 
formed list, which are separated (between them and from the list cells). Notice 
that “...*$1.plist()” on Line y specifies separation recursively, that is, from 
all locations considered in separation constraints reached via plist (and hence 
via 11_cel1). 

An important benefit of using MEMCAD is its capacity to automatically 
handle dynamic memory allocation, which is not yet supported in WP. Thus, 
we define a custom allocator that simulates the behavior of calloc for list cells 
on Lines 59-67 of Fig. 4. WP uses its contract, which is simple but currently 
unprovable by WP since dynamic allocation is not supported (it should become 
provable when this support is added into WP). 


3.2 Proof of Function list_push 


We illustrate our approach on function list_push of Fig. 4. It tries to allocate 
a new cell (Lines 87-88), and, in case of success, puts it on top of the list with 
the given data (Lines 93, 95, 97, 103). Lines 92, 96 define ghost labels (that is, 
labels used only in annotations). 

Lines 89-91 show how we use MEMCAD to verify that the new cell (referred 
to by c) is separated both from the list cells and the pointer referred to by pl 
(Line 89), and introduce these properties as assumptions for WP (admit clauses 
on Lines 90-91). They help WP to prove in an assert clause on Line 94 that 
the list remains unchanged since label Alloc (i.e. Line 92) despite writing into 
the new cell on Line 93, and a similar assertion for the old list on Lines 98-99 
despite the assignment on Line 97. 

Instead of reasoning about the modified list directly in Wp—which often 
presents another difficulty for deductive verification—we let MEMCAD check 
the list invariants on Line 100 and admit them on Lines 101-102 for WP to 
prove the post-conditions. Thanks to those assumptions, WP successfully proves 
this function. Notice that the check instruction for MEMCAD and the admit 
instructions for WP are placed (for the moment, manually) at the same program 
location to ensure the soundness of the global verification. 

In order to have a full proof, we also need to run MEMCAD to verify all the 
checks in list_push. For this purpose, we define a wrapper in Fig. 5 to analyze 
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int mc_verify_list_push(void) { 
list* pl; int i; _memcad("add_inductive(pl,plist)"); 
list_push(pl, i); 

} 


Fig. 5. Wrapper to verify list_push in MEMCAD. 


the call to list_push on Line 108 with an arbitrary list respecting the given pre- 
conditions (which correspond in MEMCAD, as we explained above, to assuming 
predicate plist for pl, cf. Lines 70-72, 107). MEMCAD also succeeds in its 
analysis, hence, we can conclude that our function respects its ACSL contract. 
While the annotation step is done manually in the current work, it can be 
better automated in the future. A coordinated generation of checks and as- 
sumptions for a given recursive data structure for both tools will facilitate the 
verification and the justification of soundness of the combined approach. An 
early idea consists in defining a domain-specific language for the description of 
the target recursive data structure that is then used for the generation of neces- 
sary predicates for MEMCAD and for WP as well as necessary assumptions and 
checks. The investigation of this research direction is left for future work. 


4 Case study on the tpm2-tss library 


We tested our approach on a few (slightly simplified) functions of the tpm2-tss 
library, a widely used open-source implementation of the TPM Software Stack 
(TSS)° designed to access the Trusted Platform Module (TPM). The library 
uses a linked list to store and use TPM resources, such as objects sent to and 
received from the TPM. List cells are dynamically allocated. Simplifications were 
applied to data structures used for list cells (and their treatment). 

We consider two functions, to add an object and to look for an object in a list, 
with one called function, and apply MEMCAD to verify separation properties 
for a newly allocated cell that WP is currently not able to deduce. A recent 
study [11] demonstrated that deductive verification with WP of these functions 
required many additional lemmas and assertions, as well as the replacement of 
the dynamic memory allocation by a static allocator. Interestingly, the difficulty 
to verify real-life code was not caused by complex operations on lists—these 
operations are in reality quite simple in the target code—but by the difficulty 
to reason about the recursive data structure itself. 

The proposed approach combining deductive verification with shape analysis 
allows us to perform a complete proof with less effort and without replacing dy- 
namic allocator by a static allocator. On the considered functions, the proof with 
WP alone [11] required 14 lemmas, leading to the generation of 241 proof obli- 
gations, one of which required a manually created WP script, and took 4m50s. 
Thanks to combining WP and MEMCAD in our work, we could remove ~45 


5 https: //trustedcomputinggroup.org/work-groups/software-stack / 
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auxiliary ACSL annotations and 5 lemmas, so the proof required only 9 lemmas, 
leading to 194 proof obligations using no scripts, and took 1min47s in total for 
WP and MEMCAD (the latter taking less than 1 sec.). 


5 Related Work and Conclusion 


Related Work. Various tools based on separation logic were proposed, such as 
VeriFast [8], Viper [7], VerCors [2]. He et al. [5] extract functional specification 
from imperative programs using a memory-safe type system and insert dynamic 
checks into the specification. GRASShopper [9] combines separation logic with 
an SMT-based verifier. Unlike in our work, GRASShopper does not integrate 
abstract interpretation based shape analysis (which allows us to infer structural 
invariants with MEMCAD without having to provide loop invariants for this 
tool). Issues reported in a recent study [11] motivate such combinations for com- 
plex real-life code with recursive data structures. Our work continues previous 
efforts by proposing a combination of weakest-precondition based deductive ver- 
ification with abstract interpretation based shape analysis on the source-code 
level, which, to the best of our knowledge, was not studied and evaluated before. 


Conclusion and Future Work. This short paper has presented an approach com- 
bining deductive verification with FRAMA-C/WP and shape analysis with MEM- 
CAD. Separation properties and structural invariants for linked data structures 
can be more easily proved by the latter, and then used as assumptions in the for- 
mer, thus allowing it to focus on other properties. This work is still ongoing and 
opens interesting research questions and perspectives: automation of the pro- 
posed verification technique including a coordinated generation of checks and 
assumptions, proof of its soundness, design of a common (higher-level) specifi- 
cation mechanism for recursive data structures with automatic translation into 
suitable definitions for MEMCAD and FRAMA-C, as well as evaluation on other 
relevant case studies. 


Data-Availability Statement. Code examples used in this paper are available on- 
line as a companion artifact on http://doi.org/10.5281/zenodo.10458675. The 
artifact includes a Virtual Machine containing the installed tools and code ex- 
amples used, and can be used to reproduce the results of this paper. 
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Abstract. Over the last years, deductive program verifiers have sub- 
stantially improved, and their applicability on non-trivial applications 
has been demonstrated. However, a major bottleneck is that for every 
new programming language, a new deductive verifier has to be built. 
This paper describes the first steps in a project that aims to address 
this problem, by language-agnostic support for deductive verification: 
Rather than building a deductive program verifier for every program- 
ming language, we develop deductive program verification technology 
for a widely-used intermediate representation language (LLVM IR), such 
that we eventually get verification support for any language that can be 
compiled into the LLVM IR format. 

Concretely, this paper describes the design of VCLLVM, a prototype tool 
that adds LLVM IR as a supported language to the VerCors verifier. We 
discuss the challenges that have to be addressed to develop verification 
support for such a low-level language. Moreover, we also sketch how we 
envisage to build verification support for any specified source program 
that can be compiled into LLVM IR on top of VCLLVM. 


1 Introduction 


As software has become an intrinsic part of our daily lives, we become more and 
more dependent on software being reliable and dependable, and we need tools 
that can help us to establish these guarantees. Over the last years, substantial 
progress has been made in the development of formal verification techniques that 
can be used to ensure that software provides certain guarantees. This covers a 
wide range of different approaches that can be used to provide guarantees at dif- 
ferent levels of abstraction and precision. Here, we focus in particular on deduc- 
tive program verification techniques |11|, which are used to provide guarantees 
directly at code level, by verifying whether a program fragment behaves accord- 
ing to the pre-postcondition-contract that is specified for it. A broad range of de- 
ductive verifiers exist, such as VerCors (4). KeY fi], VeriFast [15], Viper 
Dafny [20], RESOLVE [37], Whiley [31], Frama-C [3], KIV 9] and OpenJML [7| 
which have been used in several non-trivial case studies, see e.g. 
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[17]. A major challenge for deductive verifiers in practice is to enlarge 
the particular language features that they support. This language-dependency 
creates a severe limitation on how effective these techniques can be used in cur- 
rent software development, where language standards are regularly updated, new 
programming languages are frequently used, and applications are often written 
using multiple programming languages. 

In compiler technology, this growth in source level programming languages, 
as well as the wide range of target architectures has been tackled by the introduc- 
tion of intermediate representation formats, such as LLVM IR [19]. They require 
only a compiler into this intermediate representation format for a new program- 
ming language, while new architectures are supported by defining a mapping 
from the intermediate representation format into the new hardware. We propose 
a similar approach to reduce the language-dependence of deductive program ver- 
ification technology, by: (1) defining verification technology for LLVM IR, and 
(2) developing a generic approach to translate contract specifications from a wide 
range of source languages into contract specifications for LLVM IR. 

This paper focuses in particular on the first step in this project: it contributes 
VCLLVM, a prototype tool that encodes annotated LLVM IR programs into the 
VerCors verifier to enable deductive verification for LLVM IR. We describe 
the challenges for the encoding of LLVM IR into VerCors, as LLVM IR is a much 
lower-level language than the languages that are supported by VerCors already, 
and how these challenges affect the design and implementation of VCLLVM. We 
also sketch how we plan to use VCLLVM as a stepping stone in a bigger project 
to develop language-independent support for deductive verification. 


2 Background 
This section gives a brief background on the VerCors verifier and LLVM IR. 


VerCors VerCors |4| is a deductive verifier for concurrent programs. It can verify 
programs written in several programming languages (e.g., Java, CUDA, OpenCL, 
and its internal Prototype Verification Language PVL). To verify programs with 
VerCors, they are first annotated with pre-postcondition-contract specifications 
written in permission-based separation logic (PBSL) [88], and then the specified 
programs are encoded into the internal format of VerCors, called COL, which is 
transformed in several steps into the input language of Viper [25]. The Viper in- 
frastructure is then used for verification. If verification with Viper fails, VerCors 
translates the error message back to the level of the source program. 

PBSL is a concurrent separation logic with support for permissions [5]. 
Permissions make the language suitable to reason about concurrent programs, 
as they are used to encode when variables may be read or written. VCLLVM at 
the moment only supports sequential programs, thus we do not provide further 
details about PBSL here, and instead refer to the documentation. 


LLVM IR LLVM IR (LLVM Intermediate Representation) is the common in- 
terface for the frontend and backend compilers developed as part of the LLVM 
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project [19]. LLVM IR is designed to be abstract enough to be compiled to from 
higher level frontend languages, and simple enough to be transformed into as- 
sembly or machine code for a specific CPU architecture. It is also the language 
being operated on by middle-end code optimisation and analysis passes |23]. 
More details about LLVM IR can be found in its documentation [22]. 

The LLVM IR language is an assembly language using the single static assign- 
ment format. Each LLVM IR file consists of one module. Each module contains 
multiple functions. Functions are divided into multiple (possibly labelled) blocks, 
with one dedicated entry block. Every block consists of one or more instructions. 
We briefly summarise the main features of LLVM IR that are relevant for our 
work. First of all, LLVM IR features only two basic types, namely integers and 
floats, with the standard (bitwise) binary operators. Both come with different 
precisions. These two basic types can be combined into aggregate types, such as 
vectors, arrays, and structs, and can be referenced via pointers. Further, LLVM 
IR supports custom-declared constants and several predefined constants, such 
as true and false. The constant undef is used to present undefined state to 
the compiler as a range of possible values, which guarantees that the program 
itself remains well-defined. The constant poison indicates erroneous state of a 
program. LLVM IR offers branch instructions that can conditionally jump to 
the beginning of any instruction block in the same function. This can be used 
to encode conditionals and loops, and it offers a basis for error handling instruc- 
tions. It is important to note that the internals of LLVM IR are not stable, 
meaning there are no guarantees for compatibility between different LLVM IR 
versions (21). However, there are stable LLVM API functions that can analyse 
and manipulate the internals of LLVM IR. 


3 Challenges for Deductive Verification of LLVM IR 


In order to encode LLVM IR programs into input for the VerCors verifier, several 
challenges need to be addressed, as discussed in this section. The next section 
discusses how these challenges influence our prototype design and implementa- 
tion. In particular, challenge 1 to 3 have been addressed in our prototype, while 
providing full solutions to challenges 4 to 7 has been left as future work. 


— Challenge 1: Instability of LLVM IR As mentioned, LLVM IR is an unstable 
language (21), without backwards compatibility, and there is no guaranteed 
interoperability between the syntax of LLVM IR of different LLVM versions. 

— Challenge 2: LLVM IR Specifications VerCors specifications use expressions 
from the source language. As expressions in LLVM IR are written as a block 
of single instructions, this raises the question what a suitable specification 
language for LLVM IR would be: writing blocks in specifications (or even 
multiple blocks with branches, e.g. for Boolean expressions) would be im- 
practical and error-prone. However, an upside is that LLVM IR uses the SSA 
(static single-assignment) format, which makes it hard to write specifications 
that have side effects, and all instructions in LLVM IR are pure except for 
memory instructions such as store and alloca. 
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— Challenge 3: Origin of User Errors Parsing an LLVM IR file with the parser 
of the LLVM API returns a module object that does not retain any origin 
information; it is merely a semantically equivalent in-memory representation 
of the program. This makes it challenging to communicate the origin of a 
verification problem in the source code to the user. LLVM offers the possi- 
bility to construct a string of LLVM IR representing any LLVM value, but 
calculating line and column numbers or extracting a source string is complex 
as extraneous white spaces and comments in the source file are ignored. 


— Challenge 4: Control Flow LLVM IR depends on jumps and branches (i.e. 
goto statements) in the function body to facilitate any control flow in a pro- 
gram, while VerCors requires structured, reducible programs to be verified. 
VerCors technically supports goto statements but there are some caveats to 
be aware of when using them: the inclusion of goto statements obstructs the 
guarantee that the program is reducible [12], and loop invariants are hard 
to verify when a loop contains arbitrary goto statements. 


With that in mind, the encoding essentially needs to be an LLVM IR de- 
compiler to the high-level COL representation of VerCors. Loops can be 
especially hard to recover due to their various forms (e.g. for-loops, while- 
loops, and do-while loops), and the possibility of nesting. The challenge is 
not so much in detecting cycles in the CFG (control flow graph) of the pro- 
gram (for which trivial graph algorithms exist), but mainly to identify the 
different parts of the loop (e.g., the loop condition, the loop body, and loop 
breaks). 


— Challenge 5: Low-level Language Features LLVM IR introduces new low- 
level language constructs that have not been handled by VerCors yet, such 
as loads, stores and other low-level memory instructions, ® nodes (from the 
SSA format), and low-level exception handling. All these concepts have to 
be integrated into COL. 


The current VCLLVM prototype simplifies many of these concepts or has 
not yet implemented them. Some ideas on how other LLVM IR low-level 
concepts could be translated into COL are discussed in (28). 


— Challenge 6: LLVM Concurrency Model While LLVM IR does support in- 
structions and control mechanisms that can be useful to ensure thread safety, 
it does not support constructs for parallel thread creation or signal handling 
natively. Instead, LLVM IR code depends on being linked against existing 
concurrency libraries, e.g. the pthread library on POSIX systems for Clang. 
Thus, in order to support reasoning about these concurrency libraries, their 
behaviour has to be modeled. 


— Challenge 7: undef and poison Both constants undef and poison are se- 
mantically complex, and it is challenging to capture their semantics into 
VerCors. First, undef represents a set of possible values, which should be 
semantically treated as if it is a single value, and this concept does not yet 
exist in VerCors. Second, poison indicates erroneous behaviour, and it will 
have to be integrated into exception handling support of VerCors. 
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Fig. 1: Workflow of using VCLLVM and VerCors 


4 Design and Implementation of VCLLVM 


This section discusses the design and implementation of our prototype tool 
VCLLVM that translates LLVM IR programs into the VerCors internal COL 
format. Figure |1| gives a general overview how VCLLVM connects to VerCors. 
We discuss the main decisions in the design of VCLLVM, taking into account the 
challenges mentioned above. For a more in-depth analysis of the design choices, 
we refer to the Master thesis accompanying this paper 28}. 


Embedding versus Externalising The first design choice was whether to embed 
VCLLVM into the VerCors codebase or to develop it as an extension. Embedding 
could exacerbate the problems of Challenge 1 (instability of LLVM IR), and it 
would also restrict the tool implementation language to be JVM-compatible, 
which makes it hard to interface with existing LLVM IR functionality from the 
LLVM project. Instead, externalising makes it possible to use C++ to implement 
VCLLVM and to use all existing LLVM support functionality. We decided to go 
for this option, as it makes VCLLVM easier to maintain in the future. 


VCLLVM Output Format As VCLLVM is developed as an external tool, its 
output needs to be in a format that is either already interpretable by VerCors 
or for which an interpreter would be simple to implement. If VCLLVM would 
generate concrete syntax, this requires that we define a concrete input language 
that supports all features of LLVM IR. Instead, we opted to use serialisation, 
which makes it possible to connect to the internal COL AST directly. We use 
Protocol Buffer}| for this. It offers a largely automatable serialisation method, 
with language support for Scala (implementation language of VerCors) and C++ 
(implementation language of VCLLVM). Moreover, it supports code generation 
both from and to a Protocol Buffer definition, which simplifies the development 
of the communication layer between VCLLVM and VerCors considerably. 


Specification Syntax To specify the properties that need to be verified, we need 
to embed the specifications into LLVM IR code such that they do not change 
the behaviour of the program, but are available to VCLLVM after the LLVM IR 
program has been parsed. Since comments are ignored by the LLVM parser, the 
only option available is to use LLVM metadata to embed specifications. 


1 See: https://developers.google.com/protocol- buffers 
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!vC.contract !{ 
!"ensures", 
!"syvarl = mul i32 %y, %x", 
!"Svar2 = add i32 %arl, %z", 
!"sverdict = icmp eq i32 %var2, \result;" 


} 
(a) 
IVC.contract !{ 
IVC.contract !{ !"ensures icmp(eq, 
ensures %x * %y + %z == \result;" add(mul(%y, %x), %z), \result) ;" 
} } 
(b) (c) 


Fig. 2: Possible Specification Syntax Options 


Ideally, the specification syntax stays as close as possible to the LLVM IR 
syntax, but as explained in Challenge 2, it is not obvious for LLVM IR because of 
its low-level nature. We considered 3 different options, as illustrated in Figure 
with contracts that describe the following add-multiply LLVM IR function. 


define i32 @addMult(i32 %x, i32 %y, i32 %z) 
IVC.contract !1 ;, !2 or !3 from Figure 2 { 
%1 = mul i32 %y, %x 
%res2 = add i32 %1, %z 
ret i32 %res2 } 


This function takes as input parameters x, y and z. First it multiplies x and y, 
stores the intermediate result in a local variable %1, and then adds z to this, and 
returns this final result. All specifications in Figure [2] express that the return 
value is equal to x * y + z. As usual, we use the keyword ensures to specify 
a postcondition of the function, and \result to refer to the output value of 
the function. Figure uses blocks of instructions to write the specification 
expressions. This is verbose, error prone and complicates parsing. Figure 
uses a specification syntax that is independent of LLVM IR syntax. This is 
readable, but also creates ambiguities, as it makes it harder to connect the 
specification to the code. Finally, Figure [2cluses the known LLVM IR instruction 
keywords, but in a more functional manner. This is fairly readable, and avoids 
the ambiguity. We decided to use this option for VCLLVM. Notice that, as 
described in Section [7] eventually we hope to use VCLLVM as an intermediate 
tool to reason about programs in any language that compiles into LLVM IR. 
In that set up, the specification would be written in the input language of the 
high-level language, and compiled into a VCLLVM specification. 
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External library support LLVM IR is often compiled and linked against existing 
libraries to provide support for external libraries. Support for this is needed in 
particular to reason about concurrent LLVM IR, which rely on thread libraries. 
The VCLLVM prototype has been designed with this requirement in mind, but 
it has not yet been implemented. 


5 Evaluation 


To use the current version of VCLLVM, one needs to (1) write C code, (2) compile 
that C code to LLVM IR, (3) optionally run the LLVM opt tool to mitigate 
program structures VCLLVM cannot yet interpret, (4) annotate the resulting 
LLVM IR program manually, and (5) let VCLLVM/VerCors verify the LLVM 
IR program. C is recommended because the C LLVM compiler (Clang) produces 
concise LLVM IR code (unlike some of the other frontends like clang++ and 
rustc). Moreover, the regression test suite of VCLLVM currently only supports 

The tool is only a prototype, but it has been used on several non-trivial 
examples, such as functions to compute triangular numbers and Cantor pairs, 
a function for date comparison (using branching and integer comparison), and 
recursive functions like Fibonacci and the factorial. In order to specify func- 
tional behaviour of these programs, VCLLVM supports the definition of pure 
specification-only functions, such as for example fib: 


!'VC.global = !{!0} 
I0 = !{ 
!"pure i32 @fib(i32 %n) = 
br(icmp(sgt, %n, 2), 
add(call @fib(sub(%n, 1)), call @fib(sub(%n, 2))),1);"} 


This expresses that for any fib(n) is computed using the following expres- 
sion: if(n > 2) then fib(n - 1) + fib(n - 2) else 1 (where br denotes 
a branch and icmp compares two integers). 

Using this function, we can write and prove the following contract for a 
recursive implementation of the Fibonacci function, see for the full program. 
This contract states that for any n > 1, the correct Fibonacci value is returned. 


define dso local i32 @fibonacci(i32 noundef %0) 
!'VC.contract !{ 

"requires icmp(sge, %0, 1);", 

!"ensures icmp(eq, \result, call @fib(%0));" 
} 

{... } 


Special attention has been given to give informative feedback when verifica- 
tion fails. For more details about these examples, we refer to [28]. 
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6 Related Work 


There exist several projects that develop formal static analysis techniques for bug 
finding in LLVM IR. SMACK defines a translation of LLVM IR into Boo- 
giePL (20), to reason about C-programs using assertions that are compiled into 
LLVM IR using Clang. The verification itself is bounded and a potential exten- 
sion to contract specifications has not yet been explored. The Vellvm project 
39|) develops a framework to reason about LLVM IR programs. It provides a 
mechanised semantics for LLVM IR, which can be used for verification. Rea- 
soning is done directly in Coq, rather than at the code level, which requires 
Coq expertise. KLEE [6]. is a dynamic symbolic execution engine, which auto- 
matically generates suitable unit tests for LLVM IR applications, with a much 
better coverage than manually created test suites, thus increasing the likelihood 
of finding bugs. However, KLEE focuses only on bug finding, not on proving 
correctness. Another recent tool to easily find bugs via a bounded analysis of 
LLVM IR programs is Alive2 [24], which is tailored to reduce the number of false 
positives. Other model checkers or bounded verifiers for LLVM IR are LLMC p], 
RCMC |16|, Serval (26), FauST and SAW [8]. They can only check properties 
over a bounded state space, in contrast to our approach which uses deductive 
verification. PhASAR is a static analysis framework for LLVM IR [36]. Users 
specify arbitrary data-flow properties, and PhASAR then fully automatically 
tries to analyse these properties. The approach shows promising results, but as 
it is fully automatic, it also suffers from imprecisions that have to be manually 
filtered out. Lammich formalises the semantics of LLVM IR, using it as the 
target language of the Refinement Framework in Isabelle. They do not analyse 
LLVM IR programs, but rather they derive correct by construction LLVM IR 
programs. Finally, verifying complex programs in the current VCLLVM/VerCors 
implementation heavily relies on pure functions. This is similar to approach of 
Paganoni and Furia using predicates to verify Java bytecode. 


7 Next Steps 


As mentioned above, the current version of VCLLVM is still a prototype, and 
it needs to be extended with better support for more language features, control 
flow reconstruction, concurrency, and library inclusion. 

Ultimately, the idea is not to use VCLLVM as a standalone tool to verify 
LLVM IR programs directly, but rather to use it as part of a larger infras- 
tructure (called Pallas) that will provide deductive verification support for any 
programming language that can be compiled into LLVM IR. Figure [3] gives a 
visual representation of the Pallas infrastructure. It will define a generic spec- 
ification format for contract specifications. For each source-level programming 
language supported by Pallas, a concrete contract specification syntax is defined 
to specify the desired program properties at the level of the source language, 
and then this should bee embedded into the generic contract specification for- 
mat. The source to LLVM IR compiler is then used, combined with a compiler 
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Fig. 3: Pallas Overall Idea 


for the contract specifications in the generic contract specification format to the 
LLVM IR format. VCLLVM then enables VerCors to reason about the program. 
If verification succeeds, we know that the original source program satisfies the 
source-code-level contracts; if verification fails, the error message will be trans- 
lated back into an error message for the source program. 

Further research questions that we need to investigate to create the Pallas 
infrastructure are: (1) How to define a generic contract specification format that 
can capture program properties for a large class of source-level programming 
languages? (2) How to define a generic translation from the contract specification 
format into LLVM IR contract specifications, which can be parametrised by the 
compiler from a specific source language into LLVM IR? (3) How to provide 
effective feedback at the level of the source language if verification at the LLVM 
IR level fails by using decompilation techniques? 


8 Conclusions 


As a first step to solve the language-dependency problem of deductive verifiers, 
we propose to use the LLVM IR format as a generic format. This paper sketches 
the design of VCLLVM, a prototype implementation that enables deductive ver- 
ification of LLVM IR programs, and we discuss the kind of examples that can 
already be verified. In future work, we will expand this into a deductive verifi- 
cation framework for any language that can be compiled into LLVM IR. 


Data-Availability Statement 


The artifact accompanying this paper can be found in |41]. 
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Abstract. FDSE serves as an automatic test generation tool designed 
for C programs based on symbolic execution. FDSE employs fuzzing- 
based pre-analysis and combines static symbolic execution and dynamic 
symbolic execution to improve the effectiveness of test generation. FDSE 
achieves 5132 scores and is ranked 4th in the branch coverage track of 
Test-Comp 2024. 


Keywords: Symbolic Execution - Fuzzing - Test-Case Generation. 


1 Test Generation Approach 


Test case design is one of the most labor-intensive tasks in software engineering. 
Automatic test case generation helps the test case designers reduce labor and 
improve testing quality. Existing techniques usually accept more than one type 
of software artifact (e.g., source code and software models) as input. Then, these 
techniques utilize existing methods (e.g., optimization [10] or program analysis 
[LO]) to generate test cases. Besides, some approaches combine different methods 
to achieve better effectiveness and efficiency [I]. 

Symbolic execution (SE) [5] is one of the underlying techniques that can be 
used for automatic test case generation. Current SE methods can be categorized 
into static symbolic execution (SSE) and dynamic symbolic execution (DSE). 
SSE simulates the execution of the program using symbolic inputs. During anal- 
ysis, SSE maintains many execution states. When encountering a branch state- 
ment, SSE forks states to explore both branches. Many SSE engines have been 
developed, such as KLEE [4] and SPF [9], to name a few. DSE combines symbolic 
execution and concrete execution to further improve SE’s effectiveness and effi- 
ciency. Specifically, DSE executes the program using concrete input and collects 
path constraint of current execution. Then, based on the path constraints, DSE 
constructs the new constraint for generating new input that steers the program 


*Z. Chen—Jury Member. 


© The Author(s) 2024 
D. Beyer and A. Cavalcanti (Eds.): FASE 2024, LNCS 14573, pp. 304-308, 2024. 
https: //doi.org/10.1007/978-3-031-57259-3_16 


FDSE: Enhance Symbolic Execution by Fuzzing-based Pre-Analysis 305 


Bytecode 
[>] Sse Engine 
Instrumented 
C Code | = for Fuzzing 


[eS bA 
>| Test-Cases 
Fuzzer = Pre-Analysis = DSE Engine z 


Test 
Property =) 
Input Seed 
ad && DSE Config 


Fig. 1: FDSE’s Workflow in Test-Comp. 


to different program path. In principle, SSE and DSE provide different means of 
systematically exploring the program’s path space. 

FDSE is mainly a SE-based test case generator. In most cases, FDSE uses DSE 
to generate tests. To mitigate DSE’s disadvantage in handling the programs with 
long-time execution or large symbolic data, e.g., the programs with large sym- 
bolic arrays, loops, or many branches, FDSE employs a fuzzing-based pre-analysis 
and combines SSE to improve DSE’s effectiveness and efficiency of generating 
tests for the benchmarks of Test-Comp. 


2 Framework 


Figure [1] illustrates the Test-Comp version of FDSE. Firstly, we compile the C 
program into bytecode and instrument the bytecode to generate a fuzzer for pre- 
analysis. During fuzzing, we record the runtime features of the program, such 
as the number of input variables or branches and the size of allocated arrays. 
Secondly, we selectively employ DSE or SSE according to the number of static 
branches, which is calculated by a simple static analysis. If the number exceeds 
a threshold, e.g., 10,000 in the competition, FDSE employs SSE because DSE 
may face the challenge of long-time execution. Otherwise, FDSE continues to 
use DSE. Hence, either DSE or SSE is applied to analyze a benchmark program. 
Finally, when employing the DSE engine, selective symbolization of the variables 
is performed based on the information generated by fuzzing, aiming to mitigate 
the problem of large symbolized arrays. Furthermore, the DSE engine limits 
the number of loop unfolding times to prevent path explosion. This fuzzing- 
based pre-analysis is based on the following two observations of the Test-Comp 
benchmarks. 


— When the program utilizes large loops to initialize a large-sized symbolic 
array|'| DSE maintains a huge number of symbolic variables internally, which 
hinders the analysis’s efficiency and frequently exceeds memory limits. To 
mitigate this, we employ fuzzing for pre-analysis to generate the parameters 
that restrict the scale for DSE. 


4 For example, the benchmark standard copy2_ground-1.c 
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Taslim N 100000 Input Seed 
int main() { 
int a1[N], a2(N], a3[N], i; Variable_0 =X; Symbolic 
for(i=0; i<N; i++) { Variable_| =X; Variables 
al1[i]=input(); a2[i]=input(); | | [ee 
} Variable_99 =X; 
for(i=0; i<N; i++) a3[i]=a1[i]; Variable_100 =X; 
for(i=0; i<N; i++) a3[i]=a2[i]; Variable_101 =X; 
for(i=0; i<N; i++) 1 [e 3 
u a a TN Variable_199998=X; ~^ Concrete 
An CETAN Variable_199999 =X; ( Variables 
return 0; 
} 


Fig. 2: standard_copy2_ground-1.c Fig. 3: Selective Symbolization in FDSE 


— For programs that contain a large number of static branches [°| executing 
a terminated path needs much time, which hinders the overall efficiency 
of DSE. To tackle this problem, we propose using SSE instead of DSE to 
analyze such programs, as SSE can perform better in this scenario. 

Demonstration. We use a benchmark program in Test-Comp to demonstrate 
the fuzzing-based pre-analysis. Figure B]shows an example program that contains 
four loops with a size of 100,000 and requires 200,000 input variables (i.e., sym- 
bolic variables). SE is impractical to explore the path space of this program. The 
key idea is to employ fuzzing first to generate seed inputs and symbolize a part 
of input variables during SE, which can improve efficiency while ensuring high 
coverage. Consider the program in Figure |2| The first step is to employ fuzzing 
to generate input seeds, as shown in Figure|3] These seeds contain 200,000 vari- 
ables, each with a random value X. Since only eight static branches exist, FDSE 
uses the DSE engine. During DSE, FDSE limits the boundary of each loop, al- 
lowing the loop body to be unrolled up to a configured number of times. This 
configuration is determined by the information collected by fuzzing. FDSE unrolls 
the loop only 50 times if the fuzzer detects that the loop body is executed more 
than 100 times. Then, DSE reads the input seeds obtained from fuzzing. For this 
example, DSE only symbolizes the first 100 variables due to the 50 times of loop 
unrolling. The remaining variables only have concrete values. When generating 
test cases, the generated values of symbolic variables are concatenated with the 
values of the subsequent concrete variables in the input seed. Thus, DSE can 
still generate a complete test case. 


3 Result and Discussion 


FDSE is optimized and achieves 5132 scores (4th place) in the branch coverage 
track. Our tool performs well in many sub-categories, such as Arrays, Bit Vec- 
tors, and Hardness. Thanks to Test-Comp’s competition, we have identified 


5 For example, the program Problem05_label40+token_ring.01.cil-1.c 
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several shortcomings in our DSE engine beyond the common challenges (such as 
path explosion and constraint solving [2]). 


— Our DSE engine does not apply any simplification rule to reduce symbolic 
expressions, which results in redundant expressions and makes the tool crash 
on some Hardware benchmarks due to exceeding memory limits. 

— Our DSE engine is limited in environment modeling, e.g., the common sys- 
tem libraries. When programs call these system libraries, the relevant path 
constraints are lost, making it difficult to improve coverage, particularly in 
the tasks in BusyBox, DeviceDriverLinux64, and AWS-C-Common. 

— Our DSE engine is still limited in handling large symbolic arrays. Restricting 
the number of symbolic variables limits the path exploration ability, which 
may fail to cover deep branches. 

— We do not prioritize or minimize the generated tests, which results in redun- 
dant test cases and leads to validator timeout. For example, in the Combi- 
nations category, over 20% of tests were not executed. 

— FDSE is only optimized for branch coverage track. Smarter SE search strate- 
gies for branch and error coverage are expected. 


4 Software Project and Data Available 


The DSE engine’s implementation of FDSE is based on SymCC [8]. The SSE en- 
gine is KLEE [4]. The fuzzing component is implemented in C++ and based on 
LLVM?|6]. The employed constraint solver of DSE is Z3 [7]. The command line 
interface is implemented in Python. 

In Test-Comp 2024, FDSE participated in coverage-branches and coverage- 
error categories, where we only optimize FDSE for coverage-branches. The 
benchexec tool information module is fdse.py, and the benchmark description 
is fdse.xml. To use our tool script, the parameters of the property file, time 
budget, and benchmark path must be set as follows: 


fdse -testcomp -property-file=<..> -max-time=<..> -single-file-name=<. .> 


Our symbolic execution engine treats each benchmark as running on a 64-bit 
architecture and always tries to maximize code coverage. The test suite generated 
is written to the directory fdse_output/test-suite. According to the definition 
of Test-Comp rules, the test suite includes a metadata XML file and a test-case 
XML file that follows the required format. 

FDSE, developed by the National University of Defense Technology, can be 
found at https: //github.com/zbchen/fdse-test-comp, FDSE is accessible for down- 
load as a binary artifact on Zenodo, and the specific version available for down- 
load is testcomp24[7] and it is publicly accessible under the Apache-2.0 license 
terms. Moreover, Test-Comp 2024 [3] [°| provides users with scripts, benchmarks, 
and FDSE binaries to facilitate the replication of competition results. 


ê LLVM’s version is 10.0.1. 
T https: //doi.org/10.5281 /zenodo. 10203198 
8 https: //test-comp.sosy-lab.org/2024 
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Abstract. FIZZER is a new gray-box fuzzer. In contrast to common 
gray-box fuzzers that aim to cover both true and false branches of 
branching instructions, FIZZER primarily aims to cover both possible 
values true and false of Boolean expressions in the program. When a 
generated test evaluates a so-called atomic Boolean expression to one of 
these values, our fuzzer computes the distance to the other value, detects 
bytes that influence this distance, and applies gradient descent on these 
bytes to flip the value. In Test-Comp 2024, F1zzER placed third in the 
category Cover-Branches after FUSEBMC and FuSEBMC-AI. 


Keywords: gray-box fuzzing - dynamic analysis - gradient descent 


1 Test-Generation Approach 


Fuzzing [5] is an automatic technique that generates test inputs for a given 
program. Gray-box fuzzers first instrument the given program with a code that 
tracks selected information about a program execution. The instrumented pro- 
gram is then repeatedly executed on various inputs and the tracked information 
is used to generate new inputs that should execute parts of the program not 
executed in previous runs. 

Successful gray-box fuzzers like AFL [6] collect only very limited information 
about each program execution and try to quickly perform as many executions as 
possible. In FIZZER, we use an approach that gathers slightly more information 
about program executions and uses it to select uncovered parts of the code and 
make more targeted attempts to cover it. 

While typical gray-box fuzzers track only the information about the basic 
blocks visited during a program execution, our approach tracks also evaluation of 
each atomic Boolean expression (ABE). A Boolean expression is atomic if it is not 
a variable, not a call of a function whose definition is a part of the program, and 
not a result of applying a logical operator. Many LLVM instructions yielding i1 
type (i.e., Boolean) from other types are ABEs. An important example is the icmp 
instruction used in translations of C expressions like (x > 42) or (string[i] 
== ’A’). Each time an ABE is evaluated to true or false, the instrumented 


* This work has been supported by the Czech Science Foundation grant GA23-06506S. 
M. Trtik—Jury member. 
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code saves the calling context (i.e., the sequence of currently evaluated function 
calls, which loosely corresponds to the call stack), the value of the ABE, and the 
distance to the opposite value. For example, if ABE (x > 42) is evaluated to 
true, the distance to false is computed as x - 42. 

Our fuzzer aims to generate tests that evaluate each ABE in each reached 
calling context to both true and false. Assume that some input leads to the 
evaluation of an ABE to true and we want to evaluate it to false in the same 
calling context. We first repeatedly execute the program on various mutations 
of the input to detect the bytes of this input that have some influence on the 
distance of the ABE evaluation. This process is called a sensitivity analysis and 
the detected bytes are called sensitive. Then we apply the following two analyses 
that use the sensitive bytes. One analysis performs a gradient descent on the 
sensitive bytes with the aim to minimize the absolute value of the distance and 
to evaluate the ABE in the considered calling context to false. Alternatively, if 
we already know another input evaluating the ABE to false in a different calling 
context, we can try to use the value of its sensitive bytes instead of the sensitive 
bytes of the current input. This analysis is called byteshare analysis. 

The fuzzer maintains the information about ABEs evaluated in all program 
executions, their calling contexts, values, and distances in a binary tree called 
atomic Boolean execution tree. The tree is used to select the ABE and its value 
to be covered. 

For a more detailed and formal description of our approach, we refer to the 
corresponding research paper [4]. 


2 Software Architecture 


FIZZER is implemented in C++, consists of around 11,000 lines of code in 125 
files and uses the LLVM infrastructure. The compiled tool is dependent only on 
the CLANG compiler. FIZZER consists of two 64-bit executables, namely SERVER 
and INSTRUMENTER, and a collection of static LIBRARIES provided in both 32- 
bit and 64-bit versions. Finally, there is a Python script offering a user friendly 
interface to the tool. 

The input program is first translated to LLVM by CLANG. The INSTRU- 
MENTER then instruments the LLVM program with the code for tracking and 
collecting data during program execution, as explained in the previous section. 
The inserted code calls functions from the static LIBRARIES. The instrumented 
program linked with the corresponding static LIBRARIES is called TARGET. 

The SERVER controls the actual test generation process. In particular, SERVER 
generates inputs using the sensitivity analysis, gradient descent, and byteshare 
analysis mentioned above and runs the TARGET on these inputs. It also receives 
and processes the information tracked by the TARGET during its executions and 
builds the atomic Boolean execution tree. The tree is used to select an ABE value 
to be covered. 

The SERVER is one process and each execution of TARGET runs in another 
process. The exchange of information between the SERVER’s process and the 
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‘'TARGET’s process is done via shared memory. This ensures that the SERVER can 
receive the information about TARGET’s execution even if the execution crashes. 


3 Strengths and Weaknesses 


On the positive side, FIZZER is a relatively simple and very compact tool with 
minimal external dependencies. As it is a pure fuzzer, it can be applied to pro- 
grams of an arbitrary size and it can also handle programs that use external 
functions available only in compiled form. And covering (in)equality constraints, 
which is often difficult for fuzzers, is boosted by the gradient descent. 

Fuzzers in general limit each execution of the program as they need to per- 
form many of these executions. FIZZER sets upper bounds (passed to the tool via 
command line options) on the number of evaluated ABEs, the size of the input 
bytes read, the size of the calling context, and other properties. If an execution 
of the TARGET exceeds some of the bounds, it is terminated. FIZZER thus ob- 
tains information about prefixes of real executions and thus it can effectively 
generate tests only for parts of the program close to the program entry point. 
This weakness correlates with the well known practical experience with fuzzers 
in general: they are effective in covering code close to the entry point, but have 
troubles to get deeper. In FIZZER, we do not attempt to properly deal with this 
phenomenon. We only use so-called optimizer after fuzzing stops (usually due to 
reaching its timeout). The optimizer simply sets up the upper bounds to large 
numbers and executes the program on those generated inputs that exceeded 
some upper bound during fuzzing. 

Some weaknesses of FIZZER also come from the fact that it is only a prototype 
implementation taking advantage of some specific features of the Test-Comp 
benchmarks. In particular, the only way of reading an input currently supported 
by FIZZER are the functions __VERIFIER_nondet_*(). 

Another weakness is related to the use of gradient descent as one of the 
main techniques to cover a selected ABE. The technique is efficient when flipping 
Boolean values depending on functions with only few extremes (e.g., quadratic 
functions), but it can struggle on functions with a complex behavior (e.g., func- 
tions used for hashing). To mitigate this issue, we implemented a second version 
of the gradient descent adjusted for functions with many local extremes and we 
apply it e.g. on function XOR. 

In Test-Comp 2024, FIZZER won the bronze medal in the category Cover- 
Branches where 18 tools were competing. Moreover, it obtained the highest score 
in 3 out of 23 sub-categories of Cover-Branches, namely in ReachSafety-Floats, 
SoftwareSystems-AWS-C-Common-ReachSafety, and SoftwareSystems-BusyBox- 
MemSafety. FIZZER also participated in the Cover-Error category. It is impor- 
tant to stress that FIZZER cannot currently be instructed to focus on covering 
one particular location, like the target reach_error() of this category. FIZZER 
thus attempted to cover all ABEs in the program, just like in the other category. 
Despite of that FIZZER placed seventh out of 19 participants in this category. 
More details can be found on competition’s website [1] and report [2]. 
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4 Tool Setup and Configuration 


FIZZER can be downloaded either as a binary or as a source code (links are 
in Section 6). For the source code, checkout the commit tagged TESTCOMP24 in 
order to build the version participating in the competition. The README.md file 
in the root of the repository contains detailed instructions for building the tool. 
Once the tool is built, all binaries are under ./dist directory. The content of 
the directory can be copied “as-is” to a target computer, i.e., no installation is 
necessary. The tool should be used via sbt-fizzer.py script: 


sbt-fizzer.py [options] --input_file <my-c-program> 
--output_dir <my-output-dir> 


All results for the given C program <my-c-program> will be stored under the 
directory <my-output-dir> (including generated tests). The list of all available 
options can be obtained by command sbt-fizzer.py --help. Here are the 
options we used in the competition: 


max_seconds 865 ‘The timeout for the fuzzing. 
optimizer_max_seconds 30 The timeout for the optimizer. 


max_stdin_bytes 65536 The upper bound for the number of input bytes. 
stdin_model stdin_replay_bytes_then_repeat_zero An input model: 
Read generated input bytes and then read zeros. 

e test_type testcomp The format for the generated tests. 


e 
e 
e max_exec_milliseconds 500 ‘The timeout for each TARGET’s execution. 
e 
e 


Please note that FIZZER currently does not execute the given program in an 
isolated environment. It is thus not advised to run FIZZER directly (outside a 
container) on any C program accessing disk or other external resources. 


5 Software Project and Contributors 


FIZZER has been developed at the Faculty of Informatics of Masaryk University 
by Marek Trtik and Lukáš Urban. Martin Jonáš and Jan Strejček participated in 
discussions and contributed to the project by some ideas. The tool is open-source 
and it is available under the ZLIB license. 


6 Data-Availability statement 


FIZZER is available in a binary form at Zenodo [3] and the source code is available 
at GitHub: 


https: //github.com/staticafi/sbt-fizzer 
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Abstract. KLEEF is a complete overhaul of the KLEE symbolic ex- 
ecution engine for LLVM, fine-tuned for a robust analysis of industrial 
C/C++ code. KLEEF natively handles complex data structures, such 
as trees, linked lists, and dynamically allocated arrays, via lazy initializa- 
tion and symcrete values. KLEEF has fine-tuned modes for both maxi- 
mal test coverage generation and reproducing error traces, in particular 
reaching a specific point in the program. In the paper, we describe the 
above features and a competition configuration of KLEEF. 


Keywords: Symbolic Execution - Lazy Initialization - KLEE Fork. 


1 Test-Generation Approach 


KLEEF is a complete overhaul of the KLEE [114] symbolic execution engine. 
We first describe how KLEE works, then we describe our enhancements over it. 


1.1 Symbolic Execution in KLEE 


As a symbolic interpreter [I], KLEE runs a program on a symbolic memory, 
which maps program locations to symbolic values, representing sets of concrete 
values. When it meets a branching instruction, it adds target instructions to a 
queue and after each executed instruction it decides which instruction execute 
next. Symbolic interpreter collects all conditions from branching instructions in 
a path constraint. It is a formula, which may be either unsatisfiable (if the path 
is infeasible) or satisfiable, and have multiple solutions. Each solution gives a 
concrete test, which would visit the corresponding path. A symbolic interpreter 
usually relies on an SMT solver (like Z3 [8]) to get solutions of path constraints. 

The KLEE engine is split into two logical parts. The first part is a symbolic 
interpreter, which takes a symbolic state, executes one instruction, and produces 
new states. The second part is a searcher, which chooses the next symbolic state 
to execute according to a strategy, specified by input options, e.g., BFS or DFS. 
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1.2 Our Enhancements over KLEE 


We enhanced KLEE with support for arbitrary data structures such as trees 
and linked lists by implementing lazy initialization [7]. If KLEE dereferences 
a symbolic pointer, it forks the symbolic state into many: each one assumes that 
the pointer refers to one of the existing locations in the memory. In KLEEF 
we also fork one extra state, where the pointer refers to a fresh, lazy initialized 
symbolic object, which is distinct from all other object of the current symbolic 
memory. If there are not enough objects in the memory, KLEEF will create a 
new one and continue execution while KLEE will not. In the configuration used 
at the competition we only create lazy initialized symbolic objects for symbolic 
pointers without forking the state into existing locations beforehand. 

We improve KLEE with symcretes [10], which help to support dynamically 
allocated arrays (with symbolic sizes) and external calls. KLEEF thus supports 
detecting buffer overflows. A symcrete is a pair of symbolic value and its concrete 
instance valid in the current context. The concrete part of symcrete values is 
derived from the model of a path constraint. It stays the same if the solver can 
find a model for concretized constraints. Having failed, the concretization will 
be updated by values from the model for the original constraints. When a logical 
solver receives a query with a symcrete, an equality between the symbolic and 
concrete parts of the symcrete are added to the query. This helps the solver 
to solve the query, as a part of the model is already specified in the symcrete. 
KLEEF thus handles dynamically allocated arrays by making array size and 
address symcretes. KLEEF uses the solver to minimize possible array size and 
sparse storage for arrays, so that the entire process does not blow up. 

We have implemented searchers optimized specifically for maximizing cov- 
erage and reaching the error target. That is, KLEEF has targeted searcher and 
guided searcher which maximize coverage and error reachability, similar to [8]. 
The targeted searcher uses the shortest path based algorithm to choose the near- 
est execution state to the target location. Each execution state carries a set of 
targets. A guided searcher manages a bunch of targeted searchers with different 
targets and chooses states from every targeted searcher in interleaved manner. 

KLEEF improves over KLEE in constraint solving by caching unsatisfia- 
bility cores, interning symbolic expressions, tracking constraints during simplifi- 
cation to detect conflicts and using an SMT solver incrementally. In KLEEF we 
added support for BITWUZLA [9] SMT solver, which performs significantly bet- 
ter on TEST-COMP benchmarks. For example, KLEEF with Z3 achieves 2430 
points running for 30 seconds on TEST-COMP 2023 benchmarks, while KLEEF 
with BITWUZLA achieves 2560 points within the same time limit. 


2 Architecture 


KLEEF has the same architecture as KLEE [4]. KLEEF is implemented in 
C/C++ and relies on the LLVM infrastructure. KLEEF supports STP [5], 
Z3 [8] and Brrwuz.a [9] SMT solvers for checking constraint satisfiability. 
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3 Strengths and Weaknesses of the Approach 


KLEEF took 3rd place in TEST-COMP 2024 (Overall) [2], which is impressive 
as it is a pure symbolic execution engine. That is, it could get even better results 
if paired with fuzzing or other techniques. 

The main reasons for our advancement in coverage category are as fol- 
lows. First, it is a smart searcher which guides the symbolic execution towards 
uncovered branches. Second, it is fast constraint solving, incorporating a num- 
ber of caching techniques and solver incrementality. Third, the engine handles 
allocations with a symbolic size without concretization by using symcrete values. 

The main reasons for our advancement in error reaching category in- 
clude a smart searcher guiding the execution towards an error and elimination 
of syntactically unreachable paths in CFG. 

Note that KLEEF took less points than KLEE in error reaching cate- 
gory. KLEEF has more solved benchmarks, yet this number is normalized 
across subcategories. As KLEEF solves less benchmarks on SoftwareSystems- 
BusyBox-MemSafety and SoftwareSystems-OpenBSD-MemSafety subcategories 
than KLEE, we got less points in the error reaching category in total. Poor 
performance on these two subcategories is due to bugs in KLEEF: it generated 
a few tests which were not reproduced by the validation system. 


4 Tool Setup and Configuration 


4.1 How to Use KLEEF 


In order to run the competition version from the command line, one should 
get the archive with binaries from Zenodd'| and follow the README inside. 

In order to generate a test coverage for a project without configur- 
ing KLEEF manually, one should use a user-friendly wrapper UNITTESTBOT 
C/C++ [612]. It allows KLEEF to be run in VS Code and JetBrains CLion. 

In order to build KLEEF from sources, one should install LLVM, clone 
KLEEF from GitHul?| and run build.sh script in the repository root. 


4.2 Competition Configuration 


KLEEF participates in both Cover-Error and Cover-Branches categories. 
Common Parameters. Parameters --strip-unwanted-calls, --delete- 
dead-loops=false, --mock-all-externals are used to (de)activate necessary 
LLVM passes to simplify bitcode for a symbolic execution. A parameter -- 
external-calls=all1 allows function calls with symbolic arguments. An option 
--libc=klee makes KLEEF support an extended number of external functions. 
Parameters --cex-cache-validity-cores, --use-forked-solver=false, 
--solver-backend=bitwuzla-tree, --max-solvers-approx-tree-inc=16 are 
used to cache unsatisfiability cores and call a BITWUZLA solver incrementally. 


1 \https: //doi.org/10.5281 /zenodo.10202734 
*Intips://github.com /UnitTestBot /klee 
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Parameters --symbolic-allocation-threshold=8192, --skip-not-lazy- 
initialized, --use-sym-size-alloc are used to tune lazy initialization and 
dynamically allocated arrays. 

A parameter --fp-runtime adds a floating point support. Parameters start- 
ing with --allocate-determ activate X86 support. An option --x86FP-as- 
x87FP80 adds emulation of X86 floating points as extended 80 bit floating points. 

Finally, --max-memory and --max-time fix memory and time limit. 

Parameters for Cover-Error. An option --optimize=true simplifies code 
before execution, e.g., it joins some branches to multiple blocks into selection 
instructions. Options --search=dfs --search=bfs make KLEEF interleave 
between DFS and BFS. Options --function-call-reproduce=reach_error, 
--exit-on-error-type=Assert make KLEEF run towards reach_error func- 
tion and fail only there. An option --dump-states-on-halt=unreached permits 
KLEEF to generate tests for unfinished paths. 

Parameters for Cover-Branches. A parameter --track-coverage=al1 
makes KLEEF track coverage by both branches and instructions. Options - 
-optimize=false and --optimize-aggressive=false disable optimizations 
which decrease coverage. Options --use-iterative-deepening-search=max- 
cycles, --max-cycles-before-stuck=15 activate an iterative-deepening mode 
of execution on a number of executed loop cycles. A parameter --max-solver- 
time=10s fixes a time limit for an SMT solver. An option --only-output- 
states-covering-new makes KLEEF only generate tests which increase cov- 
erage. Options --search=dfs, --search=random-state make KLEEF inter- 
leave between DFS and taking a random state. A parameter --dump-states- 
on-halt=all makes KLEEF generate tests for the symbolic states remaining in 
the end. Options --cover-on-the-fly, --delay-cover-on-the-fly, --mem- 
trigger-cof start on the fly test generation after approaching memory cap. 


5 Software Project and Contributors 


More information about KLEEF is available on its websitd?} KLEEF is an 
open-source piece of software which you could contribute to at GitHul{4| 

The key developers are the authors of this paper affiliated with RnD Toolchain 
Labs, Huawei, Shenzhen, China. The authors have decent experience in the im- 
plementation of research and industrial symbolic execution engines. 


6 Data-Availability Statement 


A binary version of KLEEF participating in the competition is publicly avail- 
abld?| Also, its source code is available on GitHul}}} 


3 https: //toolchain-labs.com/projects/kleef.html 
https: //github.com/Unit Test Bot /klee 
https: //doi.org/10.5281/zenodo.10202734 
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Abstract. Dynamic Symbolic Execution (DSE) is an important method 
for the testing of programs. The major advantage of DSE is its path-by- 
path exploration of the program execution space. However, this often 
leads to the path explosion problem. To address this issue, a method of 
abstraction learning has been used. The key step here is the computa- 
tion of an interpolant to represent the learned abstraction. In Test-Comp 
2024, we use two different approaches of interpolant generation viz., Dele- 
tion Interpolation and Weakest Precondition Interpolation. The former 
is our more stable and mature system and briefly discussed in [8]. In 
this paper, we present the latter approach which is the heart of TracerX. 
In general, the Weakest Precondition (WP) is the ideal (most general) 
interpolant. However, WP is intractable to compute and is exponentially 
disjunctive. A major challenge is to obtain a conjunctive approximation 
of the WP. Therefore, we generate an approximation of the WP. 


Keywords: Dynamic Symbolic Execution, Interpolation, Weakest Pre- 
condition 


1 Test-Generation Approach 


DSE is an important method for program testing. The main challenge in symbolic 
execution (SE) is path explosion. The method of abstraction learning [10] has 
been used to address this by generating the interpolants to represent the learned 
abstraction. The core feature in abstraction learning is the subsumption of paths 
whose traversals are deemed to no longer be necessary due to similarity with 
already-traversed paths. Despite the overhead of computing interpolants, the 
pruning of the symbolic execution tree (SET) that interpolants provide often 
brings significant overall benefits. An interpolant of a program point (state) is 
an abstraction of it which ensures the safety of the subtree rooted at that state. 
Thus, upon encountering another state of the same program point, if the context 
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of the state implies the interpolant formula, then continuing the execution from 
the new state will not lead to any error. Consequently, we can prune the subtree 
rooted in the new state [6,7]. 

The heart of TracerX is the use of interpolation to address the path explosion 
problem in DSE. The use of interpolation to address the path explosion problem 
in DSE was first implemented in the TRACER system [9]. While TRACER was 
able to perform bounded verification and testing on many examples, it could 
not accommodate industrial programs which often dynamically manipulate heap 
memory. TracerX combines the state-of-the-art DSE technology used in KLEE 
[5] with the pruning technology in TRACER to address this issue. We presented 
the software architecture of TracerX in [8]. The default interpolation algorithm 
used by TracerX is the Deletion Interpolation and it was first developed under 
TRACER [9]. 

Since the last Test-Comp, we have designed another interpolation algorithm 
i.e., Weakest precondition (WP) interpolation. The Deletion algorithm generates 
interpolant as a subset of the incoming context (which is the strongest postcon- 
dition on the path to the assume condition), while the WP algorithm generates 
interpolants from the weakest precondition of a path in the program. Hence, the 
WP interpolation algorithm provides a more general interpolant which can have 
a higher chance of subsuming more subtrees in SET. 

The ideal (most general) interpolant is the WP of the target, which is the 
condition that must be satisfied in order to get the target satisfied. For example, 
consider the following piece of code: 


assume (not (b1 A ~ b2 A- b3)) 


if (b1) x += 3 else x += 2 The WP before the first if-statement is: 
if (b2) x += 5 else x += 7 bl — (2 A b3 Aa <7) V (b2A a < 4) 
if (b3) x += 9 else x += 14 b1 — r <3 

{x <= 24} 


Here, WP is expressed as a disjunction of two conditions. This means that either 
of the two conditions can be satisfied for the target to be reached. 
Unfortunately, WP is intractable to compute, which means it is difficult or 
impossible to find an exact solution for it. One way to approximate WP is to use a 
conjunctive approximation, which involves expressing the WP as a conjunction 
of simpler conditions. This can help to make the WP more tractable, but it 
may also introduce some imprecision to the quality of interpolants (by under 
approximation). However, this will not effect the soundness of the tool. 


1.1 TracerX-WP: Approximation of Weakest Precondition 


TracerX-WP implements the algorithm which approximates the ideal WP by 
defining two components: path interpolants and tree interpolants. In this section, 
we briefly explain how these two components are computed and used to generate 
an approximation of the weakest precondition. 

A path interpolant is a formula that represents the WP of a path. It starts 
from the end of the path (target formula) and works backward to the beginning 
of the path, using the rules of logic to compute a formula that if satisfiable then 
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target formulas will also be satisfiable. We consider a path to be a sequence of 
assignments and assume statements executed in a specific order. 

An assignment instruction assigns a value to a variable. Interpolant of an 
assignment instruction is a logical formula that describes the effect of the as- 
signment. For example, having the assignment instruction “x := z+ 2”, anda 
target “x < 15”, the interpolant is described as W P(inst, target) : x < 13. 

For an assume instruction (B), consider the incoming context {C} as the 
precondition and {w} as the target. An interpolant is a formula that represents 
the logical relationship between the variables in the context {C} and the condi- 
tions in B. To find the interpolant, we compute the coarse partition (minimum 
number of partitions) of {C} such that var(C;) x var(C;) s.t. i Æ j (* is in- 
tuitively the “separating conjunction” from separation logic [12]) as shown in 


Fal: (Oy #02 * 03%... Cy} assume(B) {wy +wx ws *..*wm} ] (1) 


We partition C; into three groups. Constraints are replaced using the rules below: 


— Target independent: The C; which are separate from B and w. 
Action: Replace C; with true, i.e. remove Ci. 

— Guard independent: Consider Cy; = C; s.t. C; x B; and, wg; = wj s.t. 
Bx Wj. 
Action: Replace Cgi by wgi-. 

— Remainder of the C;: We do not capture exact WP for this group. 
e.g. {z == 5} assume(z >z—2) {x>0} (Here, z > 2 is the WP) 
Action: No change to Cj, i.e. keep Cj. 


A tree interpolant is a formula that corresponds to all the branches of a sub- 
tree within the SET. It is computed as the conjunction of the path interpolants 
between the root of the tree and each leaf node. Tree interpolants can be used to 
prove the correctness of subtrees in the SET, by showing that a certain property 
holds for all possible paths or branches in the subtree. 


2 Software Architecture 
The software archi- 


tecture of TracerX- cpp KLEE Jeofawn Sower [ES 
WP is presented in ObjC i Test Cases 
Fig. 1. The core fea- t 

ture of TracerX-WP 
is its interpolation en- 
gine which generalizes 
the context of a node. 
TracerX-WP works at 
the level of LLVM bitcode, the intermediate language of the widely used LLVM 
compiler infrastructure [11]. It provides an interpreter that can execute al- 
most arbitrary code represented in LLVM IR, both concretely and symbolically. 
TracerX-WP has a modular and extensible architecture. It provides a variety of 
different search heuristics (e.g., Random and DFS) to explore the program state 
space. 


Annotations 


TracerX-WP Interpolant Mi 
Generation Engine 


Statistics 


Fig. 1. TracerX-WP Framework 
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3 Strengths and Weaknesses 


In Test-Comp 2024 [4], we participated with two different approaches to prune 
subtrees viz., Deletion Interpolation and WP Interpolation. We represent the 
former system as TracerX and the latter as TracerX-WP. TracerX secured a 
score of 4020 for the 11042 tasks with a CPU time of 694.44 hours and 722.22 
hours of wall time. Whereas, TracerX-WP obtained a score of 1480 for 11042 
tasks with equal CPU time and wall time of 472.22 hours. The memory used by 
TracerX and TracerX-WP are 19 TB and 10 TB. The total coverage obtained by 
TracerX and TracerX-WP are 402000 and 148000 for 11042 tasks respectively. 

The major reason for the lower score of TracerX-WP is that the imple- 
mentation of TracerX-WP is experimental. It crashed due to not supporting 
some expression types during interpolant computation. Also, in TracerX-WP, 
test cases with ‘.ktest’ extension are converted into ‘.xml’ format after the 
symbolic execution engine has finished the exploration while TracerX gener- 
ates the tests during the exploration. This resulted in the unavailability of 
test cases for the programs with timeout status in the coverage computation. 
Moreover, the configuration we used in the ‘BenchExec’ tool-info for TracerX- 
WP missed the support for 64-bit architecture. As a result, TracerX-WP was 
not able to run the tests in some categories like ReachSafety-Hardware, and 
SoftwareSystems-BusyBox-MemSafety. The fix for the above mentioned issues 
is conceptually straight forward but it requires substantial amount of work. Since, 
we need to modify the data structures used in our system. In subsequent versions, 
we will come-up more stable system with all fixes and additional features. 

In a comparison of TracerX with Symbiotic and Fizzer which won the bronze 
for the third place in Cover-Error and Cover-Branches tracks respectively, Trac- 
erX has almost equal scores in 13 out of 16 (with at most difference of 3 
tasks) and 15 out of 23 categories. TracerX has better results than Fizzer in 
some categories like ReachSafety-BitVectors, ReachSafety-Hardware, and 
ReachSafety-Combinations. These observations show the potential of TracerX 
approach and we hope to get higher scores in the future Test-Comp competitions. 


4 Setup and Configuration 


The steps to configure and running of TracerX are similar to KLEE [5] with some 
extra command-line arguments. The argument -solver-backend=z3 should be 
provided to run TracerX with Deletion Interpolation. Along with -wp-interpolant 
option is required to invoke WP Interpolation. For detailed information, please 
see the integrated --help option. 


5 Software Project and Contributors 


Information about TracerX with self-contained binary is publicly available at 
https: //tracer-x.github.io/. Also, the source code can be accessed from GitHub. 
The authors of this paper and other colleagues have contributed to and developed 
TracerX at NUS, Singapore. Authors of this paper acknowledge the direct and 
indirect support of their students, former researchers, and colleagues. 
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6 Data-Availability Statement 


The binary artifact of TracerX with Deletion Interpolation and Weakest Precon- 
dition Interpolation used in Test-Comp 2024 are publicly available at Zenodo [2] 
and [3] respectively. Also, Test-Comp 2024 [1] provides all the necessary scripts, 
benchmarks, and tool binaries to reproduce the competition’s results. 
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Abstract. We introduce Urrmare TesrGen, a novel tool for automatic 
test-case generation. Like many other test-case generators, ULTIMATE TEST- 
Gen builds on verification technology, i.e., it checks the (un)reachability 
of test goals and generates test cases from counterexamples. In contrast 
to existing tools, it applies trace abstraction, an automata-theoretic ap- 
proach to software model checking, which is implemented in the suc- 
cessful verifier Utrimare Auromizer. To avoid that the same test goal is 
reached again, Utrimare TesrGen extends the automata-theoretic model 
checking approach with error automata. 


Keywords: Utrimare Auromizer: Test-case generation - Software testing 
- Test Coverage - Software model checking - Automata 


1 Test-Generation Approach 


Verification technology has been successfully used in the past to automatically 
generate test cases [[2]14[7J1]. Most existing approaches follow a similar prin- 
ciple. Mainly, they perceive reaching an (uncovered) test goal as a property 
violation and construct test cases from counterexamples [6]. To build a test 
suite, they repeatedly check the reachability of still uncovered goals and prove 
their unreachability or generate test cases from counterexamples that testify the 
reachability of (uncovered) test goals. To improve the performance of the reach- 
ability analysis after detecting the reachability of a test goal, many approaches 
reuse previous information, e.g., continue the reachability analysis but exclude 
property violations caused by already covered test goals. Also, our new test-case 
generator ULTIMATE TESTGEN, which is implemented in Java, follows this basic 
principle. 

To analyze the reachability of test goals, ULTIMATE TESTGEN relies on trace 
abstraction |11|, an automata-theoretic approach to software model checking, 
which performs counterexample-guided abstraction refinement (CEGAR) [9] and 


* Jury Member: Max Barth 


© The Author(s) 2024 
D. Beyer and A. Cavalcanti (Eds.): FASE 2024, LNCS 14573, pp. 326-330, 2024. 
https: //doi.org/10.1007/978-3-031-57259-3_20 


Utrimate TestGen (Competition Contribution) 327 


Coverage 


Program 
property SAT 
Test done Feasibility model, 7 | 
Goal Check 
Encoder oe of 7 F s y 
true Wp) Test-Case test ` Test 
program [vitn assertions A UNSAT | proof, 7 Exporter | |case suite 
$ Y 
a to initial Analysis PE Error 
utomaton > P utomaton 
‘Tranalator abstraction A | £(A) = 0? Generation Automaton 
Nes A, Generator 
Ar} 
Refinement |_ aA 
A=A\A, 


Fig. 1. Overview of the test-case generation approach of Utrimare TesrGen 


which is implemented in ULTIMATE AUTOMIZER. Figure [1] shows the overview of 
the test-case generation process performed by ULTIMATE TESTGEN. Components 
highlighted in gray are added to the verification process of ULTIMATE AUTOMIZER 
and enable test-case generation. 

The test-case generation process starts with the encoding of the test goals 
into the program. To this end, we insert an assert (false) ; statement after each 
test goal (either a branch or a call to reach_error()). Thereafter, we translate 
the program with the assertions into an automaton A, which becomes the ini- 
tial abstraction. This initial abstraction represents all possible counterexamples, 
i.e., the initial automaton accepts a syntactical program path iff it reaches an 
assert statement (ie., a violation). Next, we iteratively refine the automaton 
abstraction until it becomes empty. 

If the abstraction still accepts a counterexample path 7, we select an arbitrary 
counterexample path a from the abstraction and check its feasibility. To check 
the feasibility of 7, ULTIMATE TESTGEN encodes the path into a formula and 
checks its satisfiability with an SMT solver. ULTIMATE TESTGEN relies on the 
SMT solvers Z3 [I3], CVC4 [3], and MathSAT5 [8]. However, during the check 
we must ensure that an assert statement introduced to cover an earlier test goal 
does not prohibit reaching later test goals. Therefore, the feasibility check ignores 
the assert statements added during test goal encoding. 

If the counterexample is spurious, i.e., the formula is unsatisfiable, we use 
the proof of unsatisfiability to generate an interpolant automaton A, [10]. The 
interpolant automaton accepts the counterexample path a and other (counter- 
example) paths that are infeasible due to a similar reason. We use the interpolant 
automaton to refine the abstraction and, thus, exclude infeasible paths, which 
are accepted by the interpolant automaton, from the counterexample search. 

If the counterexample is feasible, i.e., the formula is satisfiable, we generate 
a test case from a model of the formula [6]. To this end, we identify the calls to 
the __VERIFIER_nondet calls and retrieve their values from the model. Then, 
we export the identified values into a test case in the exchange format} used by 
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Test-Comp [5]. The values are exported in the same order as their corresponding 
calls occur in the counterexample path m. In addition, we generate an error 
automaton that accepts all counterexample paths that end in the same test 
goal as the current counterexample 7. We use the error automaton to refine the 
abstraction and exclude paths from the counterexample search that reach test 
goals that are already covered. 

The last step is the refinement of the abstraction A. This step excludes the 
paths determined irrelevant because they are known to be infeasible or may not 
reach uncovered test goals. To this end, we substract the interpolant automaton 
and error automaton, respectively from the existing abstraction. Hence, each 
step ensures that the abstraction considered in the next step considers fewer 
counterexample paths and, thus, guarantees progress of the test-case generation. 


2 Discussion of Strengths and Weaknesses 


For a comparison of ULTIMATE TESTGEN with the other participants of Test- 
Comp 2024, we refer to the competition report [5]. 

ULTIMATE TESTGEN checks the reachability of every test goal and generates 
a test case for every goal that it proved reachable. Due to this goal-oriented pro- 
cedure, it creates relatively small test suites. In addition, if ULTIMATE TESTGEN 
completes the test-case generation process (i.e., result done), we can confidently 
determine that any test goal not addressed by a test case is indeed unreachable. 

Nevertheless, proving the reachability of certain test goals can be hard and 
requires expensive SMT solver calls. When studying the results for the cate- 
gory cover-error, we observe that ULTIMATE TESTGEN runs out of resources 
(time or memory) for many software systems tasks as well as tasks in the cat- 
egories XCSP, Sequentialized, ProductLines, ECA. In addition to the resource 
issue, we observe that sometimes our tests are not confirmed by the validator, 
which seems to be a bug of the translation of the counterexamples into the test 
cases. Still, there also exist categories like loops, heap, arrays, and fuzzle in 
which ULTIMATE TESTGEN performs rather well. 

Looking at the cover-branches category, we observe that for many software 
systems tasks as well as for certain float tasks, we already fail to construct the 
automaton from the program because required C features are yet not supported 
by the program to automaton translation. In these cases, the test-case generation 
procedure does not even start. In addition, ULTIMATE TEsTGEN has problems in 
detecting the feasibility of error traces for Linux device driver tasks because 
large string literals are not precisely encoded. For other task categories like AWS, 
Sequentialized, ProductLines, Hardware, Fuzzle, ECA, and Combinations, 
we observe that reaching the test goals is expensive and ULTIMATE TESTGEN 
runs out of resources (time, memory) before covering a significant amount of 
test goals. While we have seen the resource issue for the cover-error category, 
too, the Hardness tasks reveal another issue with our test-case exporter, which 
makes ULTIMATE TESTGEN crash. The reason for the crash is that our test-case 
exporter failed to translate values from the SMT-LIB [2] FloatingPoint format 
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back to certain C types such as ulong. Note that the C types float and double 
were not an issue. Still, there exist task categories like e.g., loops, control-flow, 
bitvectors, or XCSP for which ULTIMATE TESTGEN performs well and achieves 
high coverage values. 


3 Setup and Configuration 


ULTIMATE TESTGEN is part of the Ultimate framework] which is licensed un- 
der LGPLv3. To execute ULTIMATE TESTGEN in the version submitted to Test- 
Comp 2024 [4], one requires Java 11 and Python 3.6 and must invoke the fol- 
lowing command. 


./Ultimate.py -spec <p> -file <f> -architecture <a> -full-output 


where <p> is a Test-Comp property file, <f> is an input C file, and <a> is the 
architecture (32bit or 64bit). During execution of the command, the generated 
tests are saved as .xml files in the exchange format for test cases required by 
Test-Comp [5]. In Test-Comp 2024, we use the above command to participate 
with ULTIMATE TESTGEN in both Test-Comp categories: cover-error (i.e., bug 
finding by covering the call to reach_error) and cover-branches (i.e., code 
coverage). 


Data Availability The Test-Comp 2024 version of ULTIMATE TESTGEN is avail- 
able online on Zenodo [4] and on GitHul}} Its corresponding benchmark defini- 
tion file is available on GitLatf| 
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